WenetSpeech-Chuan

Name: WenetSpeech-Chuan
Creator: Northwestern Polytechnical University, Beijing Shell Shell Technology Co. Ltd., China Telecom AI Research Institute, Nanjing University, WeNet Open Source Community
Published: 2025-09
License: Apache-2.0

First 10,000-hour-scale open-source Sichuanese speech dataset with rich annotations across multiple domains

Duration

10013 hours

Languages

Sample Rate

16 kHz

Published

2025-09

Description

1Total duration exceeding 10,013 hours, with 3,714 hours of strong labels (confidence 0.9-1.0) and 6,299 hours of weak labels (confidence 0.6-0.9)

2Data sourced from 9 domains: short videos (52.83%), entertainment (20.08%), livestreaming (18.35%), documentaries (5.36%), audiobooks (1.14%), interviews (0.89%), news (0.83%), read speech (0.48%), drama (0.05%)

3Rich annotation dimensions including transcription text, domain labels, speaker gender, age, emotion, and other paralinguistic information

4Built using the Chuan-Pipeline processing framework, incorporating VAD segmentation, single-speaker clustering, LLM-GER transcription error correction, and multimodal punctuation prediction

5Audio quality distribution concentrated in the WV-MOS 2.5-4.0 range, balancing clean recordings and real-world acoustic conditions

6Currently the largest open-source Sichuanese dialect speech dataset

Language Details

Language	Duration
Sichuanese	10013 hours

Publisher

Northwestern Polytechnical UniversityBeijing Shell Shell Technology Co. Ltd.China Telecom AI Research InstituteNanjing UniversityWeNet Open Source Community

License & Commercial Use

License

Apache-2.0

Commercial Use

Commercial use allowed

Resources

Paperhttps://arxiv.org/abs/2509.18004 Hugging Facehttps://huggingface.co/datasets/ASLP-lab/WSC-Train GitHubhttps://github.com/ASLP-lab/WenetSpeech-Chuan Sample / Demohttps://aslp-lab.github.io/WenetSpeech-Chuan/