CS-Dialogue
Largest publicly available spontaneous Mandarin-English code-switching dialogue dataset, 104 hours with 200 speakers
Duration
104 hours
Languages
2
Sample Rate
16 kHz
Published
2025-02
Description
1Total duration of 104.02 hours, containing 38,917 speech segments with an average duration of 9.62 seconds
2200 speakers participating in 100 natural conversation recordings, aged 18-53, with relatively balanced gender distribution
3Speakers are Chinese citizens with fluent English proficiency (overseas experience or IELTS 6+ / TEM-4), each compensated 300 RMB
4Each conversation contains pure Mandarin, code-switching, and pure English segments, covering 7 topic domains
5Recorded on smartphones in quiet environments, 16 kHz sample rate, 16-bit precision, mono PCM WAV format
6Complete transcription annotations including unintelligible speech, filled pauses, speaker noise, and other non-lexical event markers
7Split into training (140 speakers, 68.97 hours), validation (30 speakers, 18.30 hours), and test (30 speakers, 16.74 hours) sets with no speaker overlap
Language Details
| Language | Duration |
|---|---|
| Mandarin Chinese | None |
| English | None |
Publisher