Omnilingual ASR Corpus
Meta's 3,350-hour multilingual ASR corpus covering 348 low-resource languages, CC BY 4.0 license
Duration
3350 hours
Languages
348
Sample Rate
16 kHz
Published
2025-11
Description
1Contains 3,350 hours of transcribed speech data covering 348 low-resource languages
2Approximately 10 hours of transcribed speech per language on average, collected through compensated community partnerships
3Primarily spontaneous speech recordings with accompanying transcriptions
4Released as part of the Omnilingual ASR project, which uses 120,710 total training hours across 1,690 languages
5Core data contribution to the first ASR system covering 1,600+ languages
6CC BY 4.0 license
Language Details
| Language | Duration |
|---|---|
| Multilingual (348 low-resource languages) | 3350 hours |
Publisher