Publication Details

Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks

VILLATORO-TELLO Esaú, MADIKERI Srikanth, ZULUAGA-GOMEZ Juan, SHARMA Bidisha, SARFJOO Seyyed Saeed, NIGMATULINA Iuliia, MOTLÍČEK Petr, IVANOV Alexei V. and GANAPATHIRAJU Aravind. Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Rhodes Island: IEEE Signal Processing Society, 2023, pp. 1-5. ISBN 978-1-7281-6327-7. Available from: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10095168
Czech title
Efektivita textové, akustické a mřížkové reprezentace v úlohách porozumění mluvené řeči
Type
conference paper
Language
english
Authors
Villatoro-tello Esaú (IDIAP)
Madikeri Srikanth (IDIAP)
Zuluaga-Gomez Juan (IDIAP)
Sharma Bidisha ()
Sarfjoo Seyyed Saeed (IDIAP)
Nigmatulina Iuliia (IDIAP)
Motlíček Petr, doc. Ing., Ph.D. (DCGM FIT BUT)
Ivanov Alexei V. ()
Ganapathiraju Aravind ()
URL
Keywords

Speech Recognition, Human-computer Interaction, Spoken Language Understanding, Word Consensus Networks, Cross-modal Attention

Abstract

In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manuallygenerated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.

Published
2023
Pages
1-5
Proceedings
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Conference
2023 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE, Rhodes Island, Greece, GR
ISBN
978-1-7281-6327-7
Publisher
IEEE Signal Processing Society
Place
Rhodes Island, GR
DOI
EID Scopus
BibTeX
@INPROCEEDINGS{FITPUB13158,
   author = "Esa\'{u} Villatoro-tello and Srikanth Madikeri and Juan Zuluaga-Gomez and Bidisha Sharma and Saeed Seyyed Sarfjoo and Iuliia Nigmatulina and Petr Motl\'{i}\v{c}ek and V. Alexei Ivanov and Aravind Ganapathiraju",
   title = "Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks",
   pages = "1--5",
   booktitle = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
   year = 2023,
   location = "Rhodes Island, GR",
   publisher = "IEEE Signal Processing Society",
   ISBN = "978-1-7281-6327-7",
   doi = "10.1109/ICASSP49357.2023.10095168",
   language = "english",
   url = "https://www.fit.vut.cz/research/publication/13158"
}
Back to top