Detail výsledku

Probing Self-Supervised Learning Models With Target Speech Extraction

PENG, J.; DELCROIX, M.; OCHIAI, T.; ASHIHARA, T.; PLCHOT, O.; ARAKI, S.; ČERNOCKÝ, J. Probing Self-Supervised Learning Models With Target Speech Extraction. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Seoul: IEEE Signal Processing Society, 2024. p. 535-539. ISBN: 979-8-3503-7451-3.

Typ

článek ve sborníku konference

Jazyk

anglicky

Autoři

Peng Junyi, UPGM (FIT)
Delcroix Marc, FIT (FIT)
OCHIAI, T.
ASHIHARA, T.
Plchot Oldřich, Ing., Ph.D., UPGM (FIT)
ARAKI, S.
Černocký Jan, prof. Dr. Ing., UPGM (FIT)

Abstrakt

Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable advancements in speech-related tasks. However, the utilization of these models in complex multi-talker scenarios, such as extracting a target speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce target speech extraction (TSE) as a novel downstream task to evaluate the feature extraction capabilities of pre-trained SSL models. TSE uniquely requires both speaker identification and speech separation, distinguishing it from other tasks in the Speech processing Universal PERformance Benchmark (SUPERB) evaluation. Specifically, we propose a TSE downstream model composed of two lightweight task-oriented modules based on the same frozen SSL model. One module functions as a speaker encoder to obtain target speaker information from an enrollment speech, while the other estimates the target speaker's mask to extract its speech from the mixture. Experimental results on the Libri2mix datasets reveal the relevance of the TSE downstream task to probe SSL models, as its performance cannot be simply deduced from other related tasks such as speaker verification and separation.

Klíčová slova

Target speech extraction, self-supervised learning, SUPERB

URL

Rok

2024

Strany

535–539

Sborník

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Konference

2024 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE

ISBN

979-8-3503-7451-3

Vydavatel

IEEE Signal Processing Society

Místo

Seoul

DOI

10.1109/ICASSPW62465.2024.10627502

EID Scopus

2-s2.0-85202435980

BibTeX

@inproceedings{BUT189780,
  author="PENG, J. and DELCROIX, M. and OCHIAI, T. and ASHIHARA, T. and PLCHOT, O. and ARAKI, S. and ČERNOCKÝ, J.",
  title="Probing Self-Supervised Learning Models With Target Speech Extraction",
  booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
  year="2024",
  pages="535--539",
  publisher="IEEE Signal Processing Society",
  address="Seoul",
  doi="10.1109/ICASSPW62465.2024.10627502",
  isbn="979-8-3503-7451-3",
  url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10627502"
}

Soubory

pdf peng_icassp2024_Probing_Self-Supervised_Learning.pdf 1 MB

Projekty

Neuronové reprezentace v multimodálním a mnohojazyčném modelování, GAČR, Grantové projekty exelence v základním výzkumu EXPRO - 2019, GX19-26934X, zahájení: 2019-01-01, ukončení: 2023-12-31, ukončen

Výzkumné skupiny

Výzkumná skupina dolování dat z řeči BUT Speech@FIT (VZ SPEECH)

Pracoviště

Ústav počítačové grafiky a multimédií (UPGM)