Name: Libriheavy
Creator: Xiaomi
Published: 2023-09

Description

1Sourced from the LibriVox open-source audiobook project, generated by automatically aligning and segmenting 60,000 hours of unlabeled audio from Libri-Light

2Includes three training subsets: small (500 hours), medium (5,000 hours), large (50,000 hours), plus dev, test-clean, and test-other evaluation subsets

3Unlike other ASR datasets, provides fully formatted transcriptions with punctuation and casing information

4Each audio segment includes preceding text context, supporting contextualized speech recognition

5Metadata stored in Lhotse cuts JSON lines format, each line containing transcription and corresponding audio source information

6No speaker or book overlap between training and evaluation sets, ensuring evaluation independence

7Also open-sources a general-purpose audio alignment pipeline applicable to other alignment tasks

Language Details

Language	Duration
English	50000 hours

Publisher

Xiaomi

Resources

arXivhttps://arxiv.org/abs/2309.08105 GitHubhttps://github.com/k2-fsa/libriheavy