Result Details

Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint

PRASAD, A.; CAROFILIS, A.; VANDERREYDT, G.; KHALIL, D.; MADIKERI, S.; MOTLÍČEK, P.; SCHUEPBACH, C. Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Seoul: IEEE Signal Processing Society, 2024. p. 11921-11925. ISBN: 979-8-3503-4485-1.

Type

conference paper

Language

English

Authors

Prasad Amrutha, DCGM (FIT)
CAROFILIS, A.
VANDERREYDT, G.
KHALIL, D.
Madikeri Srikanth, FIT (FIT)
Motlíček Petr, doc. Ing., Ph.D., DCGM (FIT)
SCHUEPBACH, C.

Abstract

Self-supervised models trained with high linguistic diversity,
such as the XLS-R model, can be effectively fine-tuned for
the language recognition task. Typically, a back-end classifier
followed by statistics pooling layer are added during train-
ing. Commonly used back-end classifiers require a large num-
ber of parameters to be trained, which is not ideal in limited
data conditions. In this work, we explore smaller parame-
ter back-ends using factorized Time Delay Neural Network
(TDNN-F). The TDNN-F architecture is also integrated into
Emphasized Channel Attention, Propagation and Aggregation-
TDNN (ECAPA-TDNN) models, termed ECAPA-TDNN-F,
reducing the number of parameters by 30 to 50% absolute,
with competitive accuracies and no change in minimum cost.
The results show that the ECAPA-TDNN-F can be extended
to tasks where ECAPA-TDNN is suitable. We also test the
effectiveness of a linear classifier and a variant, the Orthonor-
mal linear classifier, previously used in x-vector type systems.
The models are trained with NIST LRE17 data and evalu-
ated on NIST LRE17, LRE22 and the ATCO2 LID datasets.
Both linear classifiers outperform conventional back-ends with
improvements in accuracy between 0.9% and 9.1%

Keywords

Language Identification, Transformers, Wav2Vec2, fine-tuning, low-resource, out-of-domain,

URL

Published

2024

Pages

11921–11925

Proceedings

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Conference

2024 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE

ISBN

979-8-3503-4485-1

Publisher

IEEE Signal Processing Society

Place

Seoul

DOI

10.1109/ICASSP48485.2024.10446751

EID Scopus

2-s2.0-85195416122

BibTeX

@inproceedings{BUT193354,
  author="PRASAD, A. and CAROFILIS, A. and VANDERREYDT, G. and KHALIL, D. and MADIKERI, S. and MOTLÍČEK, P. and SCHUEPBACH, C.",
  title="Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint",
  booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
  year="2024",
  pages="11921--11925",
  publisher="IEEE Signal Processing Society",
  address="Seoul",
  doi="10.1109/ICASSP48485.2024.10446751",
  isbn="979-8-3503-4485-1",
  url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10446751"
}

Files

pdf prasad_icassp2024_fine-tuning.pdf 941 kB

Projects

Soudobé metody zpracování, analýzy a zobrazování multimediálních a 3D dat, BUT, Vnitřní projekty VUT, FIT-S-23-8278, start: 2023-03-01, end: 2026-02-28, running

Research groups

Výzkumná skupina dolování dat z řeči BUT Speech@FIT (RG SPEECH)

Departments

Ústav počítačové grafiky a multimédií (DCGM)