Chinese-LiPS
First multimodal Chinese audio-visual speech recognition dataset combining lip reading and presentation slides, with 100 hours of manually transcribed speech
Duration
100 hours
Languages
1
Sample Rate
16 kHz
Published
2025-04
Description
1Contains approximately 100 hours of speech data, 36,208 segments from 207 speakers
2Visual modality includes both lip-reading video and speaker presentation slides
3Presentation slides designed by domain experts to ensure content quality and visual richness
4Speech recorded by professionals from various fields in China in quiet natural environments, all speakers use Mandarin
5Covers 9 topic domains: esports/gaming, automotive industry, travel exploration, sports, culture/history, science/technology, film/TV series, health/wellness, and others
6Near-balanced gender ratio of speakers at 1:1.13 (male:female)
7Speaker ages primarily distributed between 20-30 years, average segment duration of 10 seconds, maximum 30 seconds
8Split into 80% training, 15% test, and 5% validation sets with no speaker overlap across subsets
9All components carefully edited and manually aligned to ensure precision
Language Details
| Language | Duration |
|---|---|
| Mandarin Chinese | 100 hours |
Publisher