GigaSpeech 2 raw
Large-scale multi-domain multilingual ASR corpus with ~30,000 hours of auto-transcribed speech covering Thai, Indonesian, and Vietnamese
Duration
28338.7 hours
Languages
3
Sample Rate
16 kHz
Published
2024-06
Description
1Sourced from unlabeled YouTube videos through an automated crawling and transcription pipeline
2Contains approximately 28,339 hours of auto-transcribed speech data covering Thai, Indonesian, and Vietnamese
3Spans multiple domains including agriculture, arts, business, climate, culture, economics, education, entertainment, health, history, literature, music, politics, relationships, shopping, society, sports, technology, and travel
4Content formats include audiobooks, commentary, lectures, monologues, movies, news, talks, and vlogs
5Transcribed using Whisper large-v3 with TorchAudio for forced alignment
6Quality ensured through multi-dimensional filtering rules (character set filtering, language confidence filtering, audio duration filtering, balanced filtering)
7Audio converted to mono WAV format at 16 kHz sample rate
8DEV and TEST sets each contain 10 hours of professionally human-transcribed data with no speaker overlap
9Released under Creative Commons license for non-commercial research and educational use only
Language Details
| Language | Duration |
|---|---|
| Thai | 12901.8 hours |
| Indonesian | 8112.9 hours |
| Vietnamese | 7324.0 hours |
Publisher