Result Details

Multi-Sinkhorn Teacher Knowledge Aggregation Framework for Adaptive Audio Anti-Spoofing

ZHANG, R.; WEI, J.; LU, X.; ZHANG, L.; JIN, D.; LU, W.; XU, J. Multi-Sinkhorn Teacher Knowledge Aggregation Framework for Adaptive Audio Anti-Spoofing. IEEE Transactions on Audio, Speech, and Language Processing, 2025, no. 33, p. 3850-3865.
Type
journal article
Language
English
Authors
Zhang Ruiteng
Wei Jianguo
Lu Xugang
Zhang Lin, Ph.D.
Jin Di
Lu Wenhuan
Xu Junhai
Abstract

Audio anti-spoofing algorithms are widely deployed to defend against spoofing attacks, yet they often fail to detect unseen attacks. Although unsupervised domain adaptation (UDA) offers the potential to address this challenge, existing methods struggle with the large intra-class variability and complex distribution structures in target domains caused by the diversity of speech and attack types. In contrast, optimal transport (OT) leverages the geometric structure of intra-class distributions to measure discrepancies between probability distributions. The effectiveness of OT relies on the discriminability of data within target domains. However, in real-world scenarios involving multiple target domains, these domains often overlap in feature space, leading to the negative transport problem in OT. To overcome these domain mismatches in anti-spoofing, we propose the Multi-Sinkhorn Teacher Knowledge Aggregation (MSTKA) framework. Initially, to avoid interference between target domains during alignment, we use OT to adapt the source model to each target domain independently, thereby reducing negative transport. This adaptation involves constructing an OT cost matrix based on sentence-level representations of cross-domain samples and training an expert model for each target domain. Subsequently, we aggregate the knowledge from these expert models into a unified student model, enabling it to generalize across multiple target domains. Since spoofing cues could be distributed across different temporal scales, we align the student model's representations at multiple time scales with the teacher model's sentence-level representations to enhance the effectiveness of knowledge distillation. Multi-target adaptation experiments on eleven data sets demonstrate that our framework achieves state-of-the-art performance in audio anti-spoofing.

Keywords

Adaptation models, Training, Computational modeling, Feature extraction, Couplings, Costs, Probability distribution, Data models, Speech recognition, Speech processing, Audio anti-spoofing, unsupervised domain adaptation, optimal transport, knowledge distillation

URL
Published
2025
Pages
3850–3865
Journal
IEEE Transactions on Audio, Speech, and Language Processing, no. 33, ISSN
DOI
UT WoS
001579024300004
BibTeX
@article{BUT199981,
  author="{} and  {} and  {} and Lin {Zhang} and  {} and  {} and  {}",
  title="Multi-Sinkhorn Teacher Knowledge Aggregation Framework for Adaptive Audio Anti-Spoofing",
  journal="IEEE Transactions on Audio, Speech, and Language Processing",
  year="2025",
  number="33",
  pages="3850--3865",
  doi="10.1109/TASLPRO.2025.3606191",
  issn="1558-7916",
  url="https://ieeexplore.ieee.org/abstract/document/11150711"
}
Projects
Soudobé metody zpracování, analýzy a zobrazování multimediálních a 3D dat, BUT, Vnitřní projekty VUT, FIT-S-23-8278, start: 2023-03-01, end: 2026-02-28, running
Research groups
Departments
Back to top