Publication Details

DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition

POLOK, A.; KLEMENT, D.; KOCOUR, M.; HAN, J.; LANDINI, F.; YUSUF, B.; WIESNER, M.; KHUDANPUR, S.; ČERNOCKÝ, J.; BURGET, L. DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition. COMPUTER SPEECH AND LANGUAGE, 2025, p. 1-39. ISSN: 0885-2308.
Czech title
Diarizací podmíněný model Whisper pro automatické rozpoznávání řeči cílového mluvčího
Type
journal article
Language
English
Authors
Keywords

Diarization-Conditioned Whisper, Target-Speaker ASR, Speaker Diarization, Long-Form ASR, Whisper Adaptation

Abstract

Speaker-attributed automatic speech recognition (ASR) in multi-speaker
environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW extends the pre-trained  Whisper model by integrating diarization labels directly, eliminating reliance on speaker embeddings and reducing the need for extensive speaker-specific training data. Our method introduces frame-level diarization-dependent transformations (FDDT) and query-key biasing (QKb) techniques to refine the model's focus on target speakers while effectively handling overlapping speech. By leveraging diarization outputs as conditioning signals, DiCoW simplifies the workflow for multi-speaker ASR, improves generalization to unseen speakers and enables more reliable transcription in real-world multi-speaker recordings. Additionally, we explore the integration of a connectionist temporal classification (CTC) head to Whisper and demonstrate its ability to improve
transcription efficiency through hybrid decoding. Notably, we show that our approach is not limited to Whisper; it also provides similar benefits when applied to the Branchformer model. We validate DiCoW on real-world datasets, including AMI and NOTSOFAR-1 from CHiME-8 challenge, as well as synthetic benchmarks such as Libri2Mix and LibriCSS, enabling direct comparisons with previous methods. Results demonstrate that DiCoW enhances the model's target-speaker ASR capabilities while maintaining Whisper's accuracy and robustness on single-speaker data.

Published
2025 (in print)
Pages
1–39
Journal
COMPUTER SPEECH AND LANGUAGE, ISSN 0885-2308
BibTeX
@article{BUT198052,
  author="Alexander {Polok} and Dominik {Klement} and Martin {Kocour} and Jiangyu {Han} and Federico Nicolás {Landini} and Bolaji {Yusuf} and Matthew {Wiesner} and Sanjeev {Khudanpur} and Jan {Černocký} and Lukáš {Burget}",
  title="DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition",
  journal="COMPUTER SPEECH AND LANGUAGE",
  year="2025",
  pages="1--39",
  issn="0885-2308",
  url="https://www.fit.vut.cz/research/publication/13524/"
}
Files
Back to top