Result Details

How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted ASR? an Extensive Benchmark on Air Traffic Control Communications

ZULUAGA-GOMEZ, J.; PRASAD, A.; NIGMATULINA, I.; SARFJOO, S.; MOTLÍČEK, P.; KLEINERT, M.; HELMKE, H.; OHNEISER, O.; ZHAN, Q. How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted ASR? an Extensive Benchmark on Air Traffic Control Communications. In IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings. Doha: IEEE Signal Processing Society, 2023. p. 205-212. ISBN: 978-1-6654-7189-3.

Type

conference paper

Language

English

Authors

ZULUAGA-GOMEZ, J.
Prasad Amrutha, DCGM (FIT)
NIGMATULINA, I.
Sarfjoo Seyyed Saeed
Motlíček Petr, doc. Ing., Ph.D., DCGM (FIT)
KLEINERT, M.
HELMKE, H.
OHNEISER, O.
ZHAN, Q.

Abstract

Recent work on self-supervised pre-training focus on leveraging
large-scale unlabeled speech data to build robust end-to-end (E2E)
acoustic models (AM) that can be later fine-tuned on downstream
tasks e.g., automatic speech recognition (ASR). Yet, few works
investigated the impact on performance when the data properties
substantially differ between the pre-training and fine-tuning phases,
termed domain shift. We target this scenario by analyzing the robustness
of Wav2Vec 2.0 and XLS-R models on downstream ASR
for a completely unseen domain, air traffic control (ATC) communications.
We benchmark these two models on several open-source
and challenging ATC databases with signal-to-noise ratio between 5
to 20 dB. Relative word error rate (WER) reductions between 20%
to 40% are obtained in comparison to hybrid-based ASR baselines
by only fine-tuning E2E acoustic models with a smaller fraction of
labeled data. We analyze WERs on the low-resource scenario and
gender bias carried by one ATC dataset.

Keywords

Automatic speech recognition, Wav2Vec 2.0, self-supervised pre-training, air traffic control communications.

URL

Published

2023

Pages

205–212

Proceedings

IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings

Conference

IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT

ISBN

978-1-6654-7189-3

Publisher

IEEE Signal Processing Society

Place

Doha

DOI

10.1109/SLT54892.2023.10022724

UT WoS

000968851900028

EID Scopus

2-s2.0-85141659819

BibTeX

@inproceedings{BUT185194,
  author="ZULUAGA-GOMEZ, J. and PRASAD, A. and NIGMATULINA, I. and SARFJOO, S. and MOTLÍČEK, P. and KLEINERT, M. and HELMKE, H. and OHNEISER, O. and ZHAN, Q.",
  title="How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted ASR? an Extensive Benchmark on Air Traffic Control Communications",
  booktitle="IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings",
  year="2023",
  pages="205--212",
  publisher="IEEE Signal Processing Society",
  address="Doha",
  doi="10.1109/SLT54892.2023.10022724",
  isbn="978-1-6654-7189-3",
  url="https://ieeexplore.ieee.org/document/10022724"
}

Files

pdf zulaga-gomez_amrutha prasad_slt_2023_10022724.pdf 281 kB

Projects

Automatic collection and processing of voice data from air-traffic communications, EU, Horizon 2020, start: 2019-11-01, end: 2022-02-28, completed

Research groups

Speech Data Mining Research Group BUT Speech@FIT (RG SPEECH)

Departments

Department of Computer Graphics and Multimedia (DCGM)