Result Details

Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks

VILLATORO-TELLO, E.; MADIKERI, S.; ZULUAGA-GOMEZ, J.; SHARMA, B.; SARFJOO, S.; NIGMATULINA, I.; MOTLÍČEK, P.; IVANOV, V.; GANAPATHIRAJU, A. Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Rhodes Island: IEEE Signal Processing Society, 2023. p. 1-5. ISBN: 978-1-7281-6327-7.
Type
conference paper
Language
English
Authors
VILLATORO-TELLO, E.
Madikeri Srikanth, FIT (FIT)
ZULUAGA-GOMEZ, J.
SHARMA, B.
Sarfjoo Seyyed Saeed
NIGMATULINA, I.
Motlíček Petr, doc. Ing., Ph.D., DCGM (FIT)
IVANOV, V.
GANAPATHIRAJU, A.
Abstract

In this paper, we perform an exhaustive evaluation of different
representations to address the intent classification problem in a
Spoken Language Understanding (SLU) setup. We benchmark
three types of systems to perform the SLU intent detection task: 1)
text-based, 2) lattice-based, and a novel 3) multimodal approach.
Our work provides a comprehensive analysis of what could be the
achievable performance of different state-of-the-art SLU systems
under different circumstances, e.g., automatically- vs. manuallygenerated
transcripts. We evaluate the systems on the publicly
available SLURP spoken language resource corpus. Our results
indicate that using richer forms of Automatic Speech Recognition
(ASR) outputs, namely word-consensus-networks, allows the SLU
system to improve in comparison to the 1-best setup (5.5% relative
improvement). However, crossmodal approaches, i.e., learning
from acoustic and text embeddings, obtains performance similar to
the oracle setup, a relative improvement of 17.8% over the 1-best
configuration, being a recommended alternative to overcome the
limitations of working with automatically generated transcripts.

Keywords

Speech Recognition, Human-computer Interaction, Spoken Language Understanding, Word Consensus Networks, Cross-modal Attention

URL
Published
2023
Pages
1–5
Proceedings
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Conference
2023 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE
ISBN
978-1-7281-6327-7
Publisher
IEEE Signal Processing Society
Place
Rhodes Island
DOI
EID Scopus
BibTeX
@inproceedings{BUT187787,
  author="VILLATORO-TELLO, E. and MADIKERI, S. and ZULUAGA-GOMEZ, J. and SHARMA, B. and SARFJOO, S. and NIGMATULINA, I. and MOTLÍČEK, P. and IVANOV, V. and GANAPATHIRAJU, A.",
  title="Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks",
  booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
  year="2023",
  pages="1--5",
  publisher="IEEE Signal Processing Society",
  address="Rhodes Island",
  doi="10.1109/ICASSP49357.2023.10095168",
  isbn="978-1-7281-6327-7",
  url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10095168"
}
Files
Projects
Soudobé metody zpracování, analýzy a zobrazování multimediálních a 3D dat, BUT, Vnitřní projekty VUT, FIT-S-23-8278, start: 2023-03-01, end: 2026-02-28, running
Research groups
Departments
Back to top