Result Details
CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset
Hamed Injy
Shimizu Shuichiro
Lodagala Vasista Sai
Chen William
Iakovenko Olga
Talafha Bashar
Hussein Amir
Polok Alexander, Ing., DCGM (FIT)
Chang Kalvin
Klement Dominik, Ing., DCGM (FIT)
Althubaiti Sara
Peng Puyuan
Wiesner Matthew
Solorio Thamar
Ali Ahmed
Khudanpur Sanjeev
Watanabe Shinji
We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique codeswitched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research.
code-switching, code-switched speech recognition, multilingual speech recognition and translation
@inproceedings{BUT199996,
author="{} and {} and {} and {} and {} and {} and {} and {} and Alexander {Polok} and {} and Dominik {Klement} and {} and {} and {} and {} and {} and {} and {}",
title="CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset",
booktitle="Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
year="2025",
journal="Interspeech",
pages="743--747",
publisher="ISCA",
address="Rotterdam, Nizozemí",
doi="10.21437/interspeech.2025-2247",
url="https://www.isca-archive.org/interspeech_2025/yan25c_interspeech.pdf"
}