GigaSpeech 2 raw

Name: GigaSpeech 2 raw
Creator: Shanghai Jiao Tong University, Peng Cheng Laboratory, The Chinese University of Hong Kong, Tsinghua University, Harbin Institute of Technology, Birch AI, SpeechOcean, AISpeech, Seasalt AI Inc, SpeechColab
Published: 2024-06

Large-scale multi-domain multilingual ASR corpus with ~30,000 hours of auto-transcribed speech covering Thai, Indonesian, and Vietnamese

Duration

28338.7 hours

Languages

Sample Rate

16 kHz

Published

2024-06

Description

1Sourced from unlabeled YouTube videos through an automated crawling and transcription pipeline

2Contains approximately 28,339 hours of auto-transcribed speech data covering Thai, Indonesian, and Vietnamese

3Spans multiple domains including agriculture, arts, business, climate, culture, economics, education, entertainment, health, history, literature, music, politics, relationships, shopping, society, sports, technology, and travel

4Content formats include audiobooks, commentary, lectures, monologues, movies, news, talks, and vlogs

5Transcribed using Whisper large-v3 with TorchAudio for forced alignment

6Quality ensured through multi-dimensional filtering rules (character set filtering, language confidence filtering, audio duration filtering, balanced filtering)

7Audio converted to mono WAV format at 16 kHz sample rate

8DEV and TEST sets each contain 10 hours of professionally human-transcribed data with no speaker overlap

9Released under Creative Commons license for non-commercial research and educational use only

Language Details

Language	Duration
Thai	12901.8 hours
Indonesian	8112.9 hours
Vietnamese	7324.0 hours

Publisher

Shanghai Jiao Tong UniversityPeng Cheng LaboratoryThe Chinese University of Hong KongTsinghua UniversityHarbin Institute of TechnologyBirch AISpeechOceanAISpeechSeasalt AI IncSpeechColab

Resources

arXivhttps://arxiv.org/abs/2406.11546 Hugging Facehttps://huggingface.co/datasets/speechcolab/gigaspeech2 GitHubhttps://github.com/SpeechColab/GigaSpeech2