Detail výsledku

Comparison of wav2vec 2.0 models on three speech processing tasks

KUNEŠOVÁ, M.; ZAJÍC, Z.; ŠMÍDL, L.; KARAFIÁT, M. Comparison of wav2vec 2.0 models on three speech processing tasks. International Journal of Speech Technology, 2024, vol. 27, no. 4, p. 847-859. ISSN: 1572-8110.

Typ

článek v časopise

Jazyk

anglicky

Autoři

Zajíc Zbyněk, Ing., Ph.D.
Šmíd Luboš, Ing., Ph.D.
Karafiát Martin, Ing., Ph.D., UPGM (FIT)
Kunešová Marie, Ing., Ph.D.

Abstrakt

The current state-of-the-art for various speech processing problems is a sequence-to-sequence model based on a self-attention
mechanism known as transformer. The widely used wav2vec 2.0 is a self-supervised transformer model pre-trained on large
amounts of unlabeled speech and then fine-tuned for a specific task. The data used for training and fine-tuning, along with
the size of the transformer model, play a crucial role in both of these training steps. The most commonly used wav2vec 2.0
models are trained on relatively "clean" data from sources such as the LibriSpeech dataset, but we can expect there to be a
benefit in using more realistic data gathered from a variety of acoustic conditions. However, it is not entirely clear how big
the difference would be. Investigating this is the main goal of our article. To this end, we utilize wav2vec 2.0 models in three
fundamental speech processing tasks: speaker change detection, voice activity detection, and overlapped speech detection,
and test them on four real conversation datasets. We compare four wav2vec 2.0 models with different sizes and different
data used for pre-training, and we fine-tune them either on in-domain data from the same dataset or on artificial training
data created from the LibriSpeech corpus. Our results suggest that richer data that are more similar to the task domain bring
better performance than a larger model.

Klíčová slova

Speaker change detection Voice activity detection Overlapped speech detection Wav2vec 2.0

URL

Rok

2024

Strany

847–859

Časopis

International Journal of Speech Technology, roč. 27, č. 4, ISSN 1572-8110

DOI

10.1007/s10772-024-10140-6

EID Scopus

2-s2.0-85206375991

BibTeX

@article{BUT193586,
  author="Zbyněk {Zajíc} and Luboš {Šmíd} and Martin {Karafiát} and Marie {Kunešová}",
  title="Comparison of wav2vec 2.0 models on three speech processing tasks",
  journal="International Journal of Speech Technology",
  year="2024",
  volume="27",
  number="4",
  pages="847--859",
  doi="10.1007/s10772-024-10140-6",
  issn="1572-8110",
  url="https://link.springer.com/article/10.1007/s10772-024-10140-6"
}

Soubory

pdf kunesova_springer_2024_s10772-024-10140-6.pdf 1 MB

Projekty

Robustní zpracování nahrávek pro operativu a bezpečnost, MV, PROGRAM STRATEGICKÁ PODPORA ROZVOJE BEZPEČNOSTNÍHO VÝZKUMU ČR 2019-2025 (IMPAKT 1) PODPROGRAMU 1 SPOLEČNÉ VÝZKUMNÉ PROJEKTY (BV IMP1/1VS), VJ01010108, zahájení: 2020-10-01, ukončení: 2025-09-30, ukončen

Výzkumné skupiny

Výzkumná skupina dolování dat z řeči BUT Speech@FIT (VZ SPEECH)

Pracoviště

Ústav počítačové grafiky a multimédií (UPGM)