Result Details

DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition

Created: 2024
Type
software
Language
English
Authors
Polok Alexander, Ing., DCGM (FIT)
Klement Dominik, Ing., FIT (FIT), DCGM (FIT)
Kocour Martin, Ing., DCGM (FIT)
Description

DiCoW (Diarization-Conditioned Whisper) is a Target Speaker Automatic Speech Recognition (TS-ASR) system that integrates speaker diarization cues into OpenAI's Whisper model. By conditioning on speaker identity, DiCoW enables highly accurate transcription of a target speaker's speech in complex, multi-speaker environments.
At the time of publication, DiCoW achieves state-of-the-art performance on the Libri2Mix and AMI benchmarks. The system was recognized with the Jury Award at CHiME-8 Task 2 – NOTSOFAR challenge and secured Best Reproducibility Award in the Challenge and Workshop on Multilingual Conversational Speech Language Model (MLC-SLM).

Keywords

Diarization, Conditioned Whisper, Target Speaker, Automatic Speech Recognition

URL
License
In order to use the result by another entity, it is always necessary to acquire a license
License Fee
The licensor does not require a license fee for the result
Projects
Linguistics, Artificial Intelligence and Language and Speech Technologies: from Research to Applications, EU, MEZISEKTOROVÁ SPOLUPRÁCE, EH23_020/0008518, start: 2025-01-01, end: 2028-12-31, running
Research groups
Departments
Back to top