Publication Details
Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization
Diez Sánchez Mireia, M.Sc., Ph.D. (DCGM)
Lozano Díez Alicia, Ph.D.
Burget Lukáš, doc. Ing., Ph.D. (DCGM)
Speaker diarization, end-to-end neural diarization, simulated conversations
End-to-end diarization presents an attractive alternative to standard cascaded
diarization systems because a single system can handle all aspects of the task at
once. Many flavors of end-to-end models have been proposed but all of them
require (so far non-existing) large amounts of annotated data for training. The
compromise solution consists in generating synthetic data and the recently
proposed simulated conversations (SC) have shown remarkable improvements over the
original simulated mixtures (SM). In this work, we create SC with multiple
speakers per conversation and show that they allow for substantially better
performance than SM, also reducing the dependence on a fine-tuning stage. We also
create SC with wide-band public audio sources and present an analysis on several
evaluation sets. Together with this publication, we release the recipes for
generating such data and models trained on public sets as well as the
implementation to efficiently handle multiple speakers per conversation and an
auxiliary voice activity detection loss.
@inproceedings{BUT185197,
author="Federico Nicolás {Landini} and Mireia {Diez Sánchez} and Alicia {Lozano Díez} and Lukáš {Burget}",
title="Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization",
booktitle="Proceedings of ICASSP 2023",
year="2023",
pages="1--5",
publisher="IEEE Signal Processing Society",
address="Rhodes Island",
doi="10.1109/ICASSP49357.2023.10097049",
isbn="978-1-7281-6327-7",
url="https://ieeexplore.ieee.org/document/10097049"
}