Emilia
First large-scale, multilingual, diverse in-the-wild speech generation dataset with over 101k hours of natural speech in 6 languages
Duration
101654 hours
Languages
6
Sample Rate
24 kHz
Published
2024-07
Description
1Approximately 101,654 hours of multilingual speech data covering English, Chinese, German, French, Japanese, and Korean
2Sourced from online video platforms and podcasts, spanning talk shows, interviews, debates, sports commentary, audiobooks, and other content categories
3Predominantly spontaneous speech covering a wide range of speaking styles, including breathing, pauses, repetitions, tempo variations, and emotional changes
4Accompanied by the open-source preprocessing toolkit Emilia-Pipe, supporting standardization, vocal separation, speaker diarization, VAD segmentation, ASR transcription, and quality filtering
5Audio uniformly resampled to 24 kHz, mono, 16-bit, with target loudness of -20 dBFS
6Filtered by DNSMOS P.835 OVRL scores, retaining only segments scoring above 3.0, with a final average score of 3.26
7Each speech segment is 3 to 30 seconds in duration, with accompanying ASR text transcription
Language Details
| Language | Duration |
|---|---|
| Chinese | 49900 hours |
| English | 46800 hours |
| French | 1800 hours |
| Japanese | 1700 hours |
| German | 1600 hours |
| Korean | 200 hours |
Publisher