Result Details
REAL-T: Real Conversational Mixtures for Target Speaker Extraction
Current target speaker extraction (TSE) systems achieve remarkable performance on synthetic datasets like LibriMix and WSJMix. However, their effectiveness in real conversational scenarios, where the cocktail party problem is most prevalent, remains largely unexplored. In this paper, we conduct a comprehensive analysis of several speaker diarization datasets and introduce REAL-T, the first conversation-centric dataset specifically designed for TSE in real-world conditions. Our evaluations reveal significant performance degradation of existing TSE models on this dataset, highlighting the unaddressed complexity of real-world speech extraction. To facilitate controlled benchmarking, we define two subsets: BASE and PRIMARY, ensuring more manageable yet challenging evaluation settings.
conversational | dataset | REAL-T | Real-world | target speaker extraction
@inproceedings{BUT199411,
author="{} and {} and Jiangyu {Han} and {} and {} and {}",
title="REAL-T: Real Conversational Mixtures for Target Speaker Extraction",
booktitle="Proceedings of the Annual Conference of the International Speech Communication Association Interspeech",
year="2025",
journal="Interspeech",
pages="1923--1927",
publisher="International Speech Communication Association",
address="Rotterdam, The Netherlands",
doi="10.21437/Interspeech.2025-2662",
url="https://www.isca-archive.org/interspeech_2025/li25da_interspeech.pdf"
}