GigaSpeech 2 refined
High-quality multilingual ASR corpus with ~22,000 hours of iteratively label-refined speech for Thai, Indonesian, and Vietnamese
Duration
22015.0 hours
Languages
3
Sample Rate
16 kHz
Published
2024-06
Description
1Derived from GigaSpeech 2 raw through iterative label refinement using improved Noisy Student Training
2Contains approximately 22,015 hours of refined transcribed speech data covering Thai, Indonesian, and Vietnamese
3Iteratively trains teacher models and filters/re-labels pseudo-labels based on character error rate (CER) to progressively improve transcription quality
4ASR models trained on this dataset outperform Whisper large-v3 and commercial services on Thai across all benchmarks
5Achieves comparable or better performance to Whisper large-v3 on Indonesian and Vietnamese with only one-tenth the parameters
6Sourced from YouTube spontaneous speech with refined pseudo-labels
7Audio converted to mono WAV format at 16 kHz sample rate
Language Details
| Language | Duration |
|---|---|
| Thai | 10262.0 hours |
| Vietnamese | 6039.0 hours |
| Indonesian | 5714.0 hours |
Publisher