WenetSpeech-Wu

Name: WenetSpeech-Wu
Creator: Northwestern Polytechnical University, Beijing Shell Shell Technology Co. Ltd., WeNet Open Source Community, Moonstep AI, Xi'an Jiaotong-Liverpool University, YK Pao School
Published: 2026-01
License: Apache-2.0

First large-scale multi-dimensionally annotated open-source Wu Chinese speech dataset, covering ~8,000 hours and 8 Wu sub-dialects

Duration

8000 hours

Languages

Sample Rate

16 kHz

Published

2026-01

Description

1Contains approximately 8,000 hours of Wu Chinese speech data, 3.86 million speech segments with an average duration of 7.45 seconds

2Covers 8 Wu sub-dialects: Shanghainese, Suzhounese, Shaoxingnese, Ningbonese, Hangzhounese, Jiaxingnese, Taizhouese, and Wenzhounese

3Spans 11 domains: news, culture, vlog, entertainment, education, podcast, commentary, interviews, radio drama, music programs, and audiobooks

4Multi-dimensional annotations: transcription text (with confidence scores), Wu-to-Mandarin translation, domain and sub-dialect labels, speaker attributes (gender, age), emotion labels, and audio quality metrics

5Quality filtering using DNSMOS and SNR, with high-quality transcriptions generated through multi-ASR system ROVER fusion

6Tiered data quality strategy designed for different tasks, supporting ASR, TTS, speech translation, emotion recognition, and instructed TTS

7Sourced from in-the-wild Wu Chinese speech; approximately 37% of recordings have unidentifiable specific sub-dialects

Language Details

Language	Duration
Wu Chinese	8000 hours

Publisher

Northwestern Polytechnical UniversityBeijing Shell Shell Technology Co. Ltd.WeNet Open Source CommunityMoonstep AIXi'an Jiaotong-Liverpool UniversityYK Pao School

License & Commercial Use

License

Apache-2.0

Commercial Use

Commercial use allowed

Resources

GitHubhttps://github.com/ASLP-lab/WenetSpeech-Wu-Repo Paperhttps://arxiv.org/abs/2601.11027 Hugging Facehttps://huggingface.co/datasets/ASLP-lab/WenetSpeech-Wu Sample / Demohttps://hujingbin1.github.io/WenetSpeechWu-Demo-Page-Public/