Description

1Derived from GigaSpeech 2 raw through iterative label refinement using improved Noisy Student Training

2Contains approximately 22,015 hours of refined transcribed speech data covering Thai, Indonesian, and Vietnamese

3Iteratively trains teacher models and filters/re-labels pseudo-labels based on character error rate (CER) to progressively improve transcription quality

4ASR models trained on this dataset outperform Whisper large-v3 and commercial services on Thai across all benchmarks

5Achieves comparable or better performance to Whisper large-v3 on Indonesian and Vietnamese with only one-tenth the parameters

6Sourced from YouTube spontaneous speech with refined pseudo-labels

7Audio converted to mono WAV format at 16 kHz sample rate

Language Details

Language	Duration
Thai	10262.0 hours
Vietnamese	6039.0 hours
Indonesian	5714.0 hours

Publisher

Shanghai Jiao Tong UniversityPeng Cheng LaboratoryThe Chinese University of Hong KongTsinghua UniversityHarbin Institute of TechnologyBirch AISpeechOceanAISpeechSeasalt AI IncSpeechColab

Resources

arXivhttps://arxiv.org/abs/2406.11546 Hugging Facehttps://huggingface.co/datasets/speechcolab/gigaspeech2 GitHubhttps://github.com/SpeechColab/GigaSpeech2

GigaSpeech 2 refined