VoxPopuli
A large-scale multilingual speech corpus from European Parliament recordings, with 400K hours of unlabeled speech covering 23 languages
Duration
400000 hours
Languages
23
Sample Rate
16 kHz
Published
2021-01
Description
1Collected from publicly available European Parliament recordings
2400K hours of unlabeled speech data covering 23 languages, 9,000-18,000 hours per language
31.8K hours of transcribed speech covering 15 languages
417.3K hours of interpretation-aligned audio covering 15 target languages
5CC0 license
6Currently the largest open dataset for unsupervised representation learning and semi-supervised learning
Language Details
| Language | Duration |
|---|---|
| English | None |
| German | None |
| French | None |
| Spanish | None |
| Polish | None |
Publisher