Description

1Total duration of 21,800 hours, containing short and long audio segments with an average segment duration of 11.40 seconds

2Covers 10 domains: storytelling, entertainment, drama, culture, vlog, commentary, education, podcast, news, and others

3Multi-dimensional annotations include: ASR transcription, text confidence scores, speaker identity, age, gender, speech quality metrics (SNR, DNSMOS), and word-level timestamps

4High-quality transcriptions generated through multi-system ASR fusion voting (ROVER) and LLM correction

5Divided into three subsets by confidence: strong labels (>0.9, 6,771.43 hours), medium labels (0.8-0.9, 10,615.02 hours), weak labels (0.6-0.8, 4,488.13 hours)

6Filtering by DNSMOS>2.5 and SNR>25dB yields a 12,000-hour high-quality TTS subset

7Speakers are predominantly middle-aged males (50.6%), sourced from in-the-wild spontaneous speech

Language Details

Language	Duration
Cantonese	21800 hours

Publisher

Northwestern Polytechnical UniversityChina Telecom AI Research InstituteBeijing Shell Shell Technology Co. Ltd.WeNet Open Source CommunityHong Kong University of Science and Technology

Resources

arXivhttps://arxiv.org/abs/2509.03959 GitHubhttps://github.com/ASLP-lab/WenetSpeech-Yue