Detail výsledku

Factorized RVQ-GAN For Disentangled Speech Tokenization

KHURANA, S.; KLEMENT, D.; LAURENT, A.; BOBOS, D.; NOVOSAD, J.; GAZDIK, P.; ZHANG, E.; HUANG, Z.; HUSSEIN, A.; MARXER, R.; MASUYAMA, Y.; AIHARA, R.; HORI, C.; GERMAIN, F.; WICHERN, G.; LE ROUX, J. Factorized RVQ-GAN For Disentangled Speech Tokenization. In Proceedings of the Annual Conference of the International Speech Communication Association Interspeech. Interspeech. Rotterdam, The Netherlands: International Speech Communication Association, 2025. p. 3514-3518.
Typ
článek ve sborníku konference
Jazyk
anglicky
Autoři
Khurana Sameer
Klement Dominik, Ing., FIT (FIT), UPGM (FIT)
Laurent Antoine
Bobos Dominik
Novosad Juraj
Gazdik Peter
Zhang Ellen
Huang Zili
Hussein Amir
Marxer Ricard
Masuyama Yoshiki
Aihara Ryo
Hori Chiori
Germain François G.
Wichern Gordon
Le Roux Jonathan
Abstrakt

We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.

Klíčová slova

Audio Codec | GAN | RVQ | Speech Tokenization

URL
Rok
2025
Strany
3514–3518
Časopis
Interspeech, ISSN
Sborník
Proceedings of the Annual Conference of the International Speech Communication Association Interspeech
Konference
Interspeech Conference
Vydavatel
International Speech Communication Association
Místo
Rotterdam, The Netherlands
DOI
EID Scopus
BibTeX
@inproceedings{BUT199387,
  author="{} and Dominik {Klement} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {}",
  title="Factorized RVQ-GAN For Disentangled Speech Tokenization",
  booktitle="Proceedings of the Annual Conference of the International Speech Communication Association Interspeech",
  year="2025",
  journal="Interspeech",
  pages="3514--3518",
  publisher="International Speech Communication Association",
  address="Rotterdam, The Netherlands",
  doi="10.21437/Interspeech.2025-2612",
  url="https://www.isca-archive.org/interspeech_2025/khurana25_interspeech.pdf"
}
Projekty
Pracoviště
Nahoru