Result Details

Multi-Sinkhorn Teacher Knowledge Aggregation Framework for Adaptive Audio Anti-Spoofing

ZHANG, R.; WEI, J.; LU, X.; ZHANG, L.; JIN, D.; LU, W.; XU, J. Multi-Sinkhorn Teacher Knowledge Aggregation Framework for Adaptive Audio Anti-Spoofing. IEEE Transactions on Audio, Speech, and Language Processing, 2025, no. 33, p. 3850-3865.

Type

journal article

Language

English

Authors

Zhang Ruiteng
Wei Jianguo
Lu Xugang
Zhang Lin, Ph.D.
Jin Di
Lu Wenhuan
Xu Junhai

Abstract

Audio anti-spoofing algorithms are widely deployed to defend against spoofing attacks, yet they often fail to detect unseen attacks. Although unsupervised domain adaptation (UDA) offers the potential to address this challenge, existing methods struggle with the large intra-class variability and complex distribution structures in target domains caused by the diversity of speech and attack types. In contrast, optimal transport (OT) leverages the geometric structure of intra-class distributions to measure discrepancies between probability distributions. The effectiveness of OT relies on the discriminability of data within target domains. However, in real-world scenarios involving multiple target domains, these domains often overlap in feature space, leading to the negative transport problem in OT. To overcome these domain mismatches in anti-spoofing, we propose the Multi-Sinkhorn Teacher Knowledge Aggregation (MSTKA) framework. Initially, to avoid interference between target domains during alignment, we use OT to adapt the source model to each target domain independently, thereby reducing negative transport. This adaptation involves constructing an OT cost matrix based on sentence-level representations of cross-domain samples and training an expert model for each target domain. Subsequently, we aggregate the knowledge from these expert models into a unified student model, enabling it to generalize across multiple target domains. Since spoofing cues could be distributed across different temporal scales, we align the student model's representations at multiple time scales with the teacher model's sentence-level representations to enhance the effectiveness of knowledge distillation. Multi-target adaptation experiments on eleven data sets demonstrate that our framework achieves state-of-the-art performance in audio anti-spoofing.

Keywords

Adaptation models, Training, Computational modeling, Feature extraction, Couplings, Costs, Probability distribution, Data models, Speech recognition, Speech processing, Audio anti-spoofing, unsupervised domain adaptation, optimal transport, knowledge distillation

URL

https://ieeexplore.ieee.org/abstract/document/11150711

Published

2025

Pages

3850–3865

Journal

IEEE Transactions on Audio, Speech, and Language Processing, no. 33, ISSN

DOI

10.1109/TASLPRO.2025.3606191

UT WoS

001579024300004

BibTeX

@article{BUT199981,
  author="{} and  {} and  {} and Lin {Zhang} and  {} and  {} and  {}",
  title="Multi-Sinkhorn Teacher Knowledge Aggregation Framework for Adaptive Audio Anti-Spoofing",
  journal="IEEE Transactions on Audio, Speech, and Language Processing",
  year="2025",
  number="33",
  pages="3850--3865",
  doi="10.1109/TASLPRO.2025.3606191",
  issn="1558-7916",
  url="https://ieeexplore.ieee.org/abstract/document/11150711"
}

Projects

Soudobé metody zpracování, analýzy a zobrazování multimediálních a 3D dat, BUT, Vnitřní projekty VUT, FIT-S-23-8278, start: 2023-03-01, end: 2026-02-28, running

Research groups

Výzkumná skupina dolování dat z řeči BUT Speech@FIT (RG SPEECH)

Departments

Ústav počítačové grafiky a multimédií (DCGM)
Výzkumná skupina dolování dat z řeči BUT Speech@FIT (RG SPEECH)