Emilia

Name: Emilia
Creator: The Chinese University of Hong Kong (Shenzhen), Institute of Acoustics (Chinese Academy of Sciences), University of Chinese Academy of Sciences, Shanghai AI Laboratory
Published: 2024-07
License: CC BY 4.0

First large-scale, multilingual, diverse in-the-wild speech generation dataset with over 101k hours of natural speech in 6 languages

Duration

101654 hours

Languages

Sample Rate

24 kHz

Published

2024-07

Description

1Approximately 101,654 hours of multilingual speech data covering English, Chinese, German, French, Japanese, and Korean

2Sourced from online video platforms and podcasts, spanning talk shows, interviews, debates, sports commentary, audiobooks, and other content categories

3Predominantly spontaneous speech covering a wide range of speaking styles, including breathing, pauses, repetitions, tempo variations, and emotional changes

4Accompanied by the open-source preprocessing toolkit Emilia-Pipe, supporting standardization, vocal separation, speaker diarization, VAD segmentation, ASR transcription, and quality filtering

5Audio uniformly resampled to 24 kHz, mono, 16-bit, with target loudness of -20 dBFS

6Filtered by DNSMOS P.835 OVRL scores, retaining only segments scoring above 3.0, with a final average score of 3.26

7Each speech segment is 3 to 30 seconds in duration, with accompanying ASR text transcription

Language Details

Language	Duration
Chinese	49900 hours
English	46800 hours
French	1800 hours
Japanese	1700 hours
German	1600 hours
Korean	200 hours

Publisher

The Chinese University of Hong Kong (Shenzhen)Institute of Acoustics (Chinese Academy of Sciences)University of Chinese Academy of SciencesShanghai AI Laboratory

License & Commercial Use

License

CC BY 4.0

Commercial Use

Commercial use allowed

Resources

Paperhttps://arxiv.org/abs/2407.05361 Hugging Facehttps://huggingface.co/datasets/amphion/Emilia-Dataset Sample / Demohttps://emilia-dataset.github.io/Emilia-Demo-Page/