Publication Details

Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint

PRASAD, A.; CAROFILIS, A.; VANDERREYDT, G.; KHALIL, D.; MADIKERI, S.; MOTLÍČEK, P.; SCHUEPBACH, C. Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Seoul: IEEE Signal Processing Society, 2024. p. 11921-11925. ISBN: 979-8-3503-4485-1.
Czech title
Fine-Tuning samoučicích modelů pro identifikaci jazyka pomocí ortonormálního omezení
Type
conference paper
Language
English
Authors
Prasad Amrutha (DCGM)
CAROFILIS, A.
VANDERREYDT, G.
KHALIL, D.
Madikeri Srikanth
Motlíček Petr, doc. Ing., Ph.D. (DCGM)
SCHUEPBACH, C.
URL
Keywords

Language Identification, Transformers, Wav2Vec2, fine-tuning, low-resource, out-of-domain,

Abstract

Self-supervised models trained with high linguistic diversity,
such as the XLS-R model, can be effectively fine-tuned for
the language recognition task. Typically, a back-end classifier
followed by statistics pooling layer are added during train-
ing. Commonly used back-end classifiers require a large num-
ber of parameters to be trained, which is not ideal in limited
data conditions. In this work, we explore smaller parame-
ter back-ends using factorized Time Delay Neural Network
(TDNN-F). The TDNN-F architecture is also integrated into
Emphasized Channel Attention, Propagation and Aggregation-
TDNN (ECAPA-TDNN) models, termed ECAPA-TDNN-F,
reducing the number of parameters by 30 to 50% absolute,
with competitive accuracies and no change in minimum cost.
The results show that the ECAPA-TDNN-F can be extended
to tasks where ECAPA-TDNN is suitable. We also test the
effectiveness of a linear classifier and a variant, the Orthonor-
mal linear classifier, previously used in x-vector type systems.
The models are trained with NIST LRE17 data and evalu-
ated on NIST LRE17, LRE22 and the ATCO2 LID datasets.
Both linear classifiers outperform conventional back-ends with
improvements in accuracy between 0.9% and 9.1%

Published
2024
Pages
11921–11925
Proceedings
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Conference
2024 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE, Seoul, KR
ISBN
979-8-3503-4485-1
Publisher
IEEE Signal Processing Society
Place
Seoul
DOI
EID Scopus
BibTeX
@inproceedings{BUT193354,
  author="PRASAD, A. and CAROFILIS, A. and VANDERREYDT, G. and KHALIL, D. and MADIKERI, S. and MOTLÍČEK, P. and SCHUEPBACH, C.",
  title="Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint",
  booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
  year="2024",
  pages="11921--11925",
  publisher="IEEE Signal Processing Society",
  address="Seoul",
  doi="10.1109/ICASSP48485.2024.10446751",
  isbn="979-8-3503-4485-1",
  url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10446751"
}
Files
Back to top