WenetSpeech
A 10,000+ hour multi-domain open-source Mandarin ASR corpus, the largest-scale Chinese open-source speech dataset
Duration
22400 hours
Languages
1
Sample Rate
16 kHz
Published
2021-10
Description
1Collected from YouTube and Podcasts, covering multiple domains and speaking styles
210,000+ hours of strongly labeled data (confidence >= 0.95), 2,400+ hours of weakly labeled data, ~10,000 hours of unlabeled data, totaling 22,400+ hours
3Automatic labeling using OCR and ASR technologies, with quality filtering via end-to-end label error detection
4Provides S, M, L training subsets for building ASR systems at different data scales
5Strongly labeled data categorized into 10 domain groups
Language Details
| Language | Duration |
|---|---|
| Mandarin Chinese | 22400 hours |
Publisher