Libriheavy
A 50,000-hour English read speech ASR corpus from LibriVox audiobooks with punctuation, casing, and context information, the largest free supervised English speech dataset
Duration
50000 hours
Languages
1
Sample Rate
16 kHz
Published
2023-09
Description
1Sourced from the LibriVox open-source audiobook project, generated by automatically aligning and segmenting 60,000 hours of unlabeled audio from Libri-Light
2Includes three training subsets: small (500 hours), medium (5,000 hours), large (50,000 hours), plus dev, test-clean, and test-other evaluation subsets
3Unlike other ASR datasets, provides fully formatted transcriptions with punctuation and casing information
4Each audio segment includes preceding text context, supporting contextualized speech recognition
5Metadata stored in Lhotse cuts JSON lines format, each line containing transcription and corresponding audio source information
6No speaker or book overlap between training and evaluation sets, ensuring evaluation independence
7Also open-sources a general-purpose audio alignment pipeline applicable to other alignment tasks
Language Details
| Language | Duration |
|---|---|
| English | 50000 hours |
Publisher