Publication Details
CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification
Mošner Ladislav, Ing. (DCGM)
Zhang Lin, Ph.D.
Plchot Oldřich, Ing., Ph.D. (DCGM)
Stafylakis Themos
Burget Lukáš, doc. Ing., Ph.D. (DCGM)
Černocký Jan, prof. Dr. Ing. (DCGM)
Self-supervised learning, speaker verification, speaker extractor, pooling
mechanism, speech classification
Self-supervised learning (SSL) models for speaker verifica-
tion (SV) have gained significant attention in recent years. However,
existing SSL-based SV systems often struggle to capture local temporal
dependencies and generalize across different tasks. In this paper, we pro-
pose context-aware multi-head factorized attentive pooling (CA-MHFA),
a lightweight framework that incorporates contextual information from
surrounding frames. CA-MHFA leverages grouped, learnable queries to
effectively model contextual dependencies while maintaining efficiency
by sharing keys and values across groups. Experimental results on the
VoxCeleb dataset show that CA-MHFA achieves EERs of 0.42%, 0.48%,
and 0.96% on Vox1-O, Vox1-E, and Vox1-H, respectively, outperforming
complex models like WavLM-TDNN with fewer parameters and faster
convergence. Additionally, CA-MHFA demonstrates strong generalization
across multiple SSL models and tasks, including emotion recognition and
anti-spoofing, highlighting its robustness and versatility.
@inproceedings{BUT198050,
author="Junyi {Peng} and Ladislav {Mošner} and Lin {Zhang} and Oldřich {Plchot} and Themos {Stafylakis} and Lukáš {Burget} and Jan {Černocký}",
title="CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification",
booktitle="Proceedings of ICASSP 2025",
year="2025",
pages="1--5",
publisher="IEEE Biometric Council",
address="Hyderabad",
doi="10.1109/ICASSP49660.2025.10889058",
isbn="979-8-3503-6874-1",
url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10889058"
}