Result Details
Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering
Carofilis Andres
Prakash Jeena
Kumar Shashi
Burdisso Sergio
Madikeri Srikanth
Villatoro-Tello Esau
Sharma Bidisha
Motlíček Petr, doc. Ing., Ph.D., DCGM (FIT)
Hacioglu Kadri
Venkatesan Shankar
Vyas Saurabh
Stolcke Andreas
Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies-including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis-to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.
speech recognition, data selection, whisper, zip-formers
@inproceedings{BUT201433,
author="{} and {} and {} and {} and {} and {} and {} and {} and Petr {Motlíček} and {} and {} and {} and {}",
title="Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering",
booktitle="Interspeech",
year="2025",
journal="Interspeech",
pages="4928--4932",
publisher="Isca-Int Speech Communication Assoc",
address="Rotterdam, The Netherlands",
doi="10.21437/Interspeech.2025-2580",
url="https://www.fit.vut.cz/research/group/speech/public/publi/2025/rangappa_INTERSPEECH_2025_co-author_Motlicek.pdf"
}