WenetSpeech-Yue
First large-scale open-source Cantonese speech corpus, covering 21,800 hours across 10 domains with multi-dimensional annotations for ASR and TTS
Duration
21800 hours
Languages
1
Sample Rate
16 kHz
Published
2025-09
Description
1Total duration of 21,800 hours, containing short and long audio segments with an average segment duration of 11.40 seconds
2Covers 10 domains: storytelling, entertainment, drama, culture, vlog, commentary, education, podcast, news, and others
3Multi-dimensional annotations include: ASR transcription, text confidence scores, speaker identity, age, gender, speech quality metrics (SNR, DNSMOS), and word-level timestamps
4High-quality transcriptions generated through multi-system ASR fusion voting (ROVER) and LLM correction
5Divided into three subsets by confidence: strong labels (>0.9, 6,771.43 hours), medium labels (0.8-0.9, 10,615.02 hours), weak labels (0.6-0.8, 4,488.13 hours)
6Filtering by DNSMOS>2.5 and SNR>25dB yields a 12,000-hour high-quality TTS subset
7Speakers are predominantly middle-aged males (50.6%), sourced from in-the-wild spontaneous speech
Language Details
| Language | Duration |
|---|---|
| Cantonese | 21800 hours |
Publisher