Result Details
DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition
Klement Dominik, Ing., FIT (FIT), DCGM (FIT)
Kocour Martin, Ing., DCGM (FIT)
DiCoW (Diarization-Conditioned Whisper) is a Target Speaker Automatic Speech Recognition (TS-ASR) system that integrates speaker diarization cues into OpenAI's Whisper model. By conditioning on speaker identity, DiCoW enables highly accurate transcription of a target speaker's speech in complex, multi-speaker environments.
At the time of publication, DiCoW achieves state-of-the-art performance on the Libri2Mix and AMI benchmarks. The system was recognized with the Jury Award at CHiME-8 Task 2 – NOTSOFAR challenge and secured Best Reproducibility Award in the Challenge and Workshop on Multilingual Conversational Speech Language Model (MLC-SLM).
Diarization, Conditioned Whisper, Target Speaker, Automatic Speech Recognition