Detail výsledku

Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks

VILLATORO-TELLO, E.; MADIKERI, S.; ZULUAGA-GOMEZ, J.; SHARMA, B.; SARFJOO, S.; NIGMATULINA, I.; MOTLÍČEK, P.; IVANOV, V.; GANAPATHIRAJU, A. Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Rhodes Island: IEEE Signal Processing Society, 2023. p. 1-5. ISBN: 978-1-7281-6327-7.
Typ
článek ve sborníku konference
Jazyk
anglicky
Autoři
VILLATORO-TELLO, E.
Madikeri Srikanth, FIT (FIT)
ZULUAGA-GOMEZ, J.
SHARMA, B.
Sarfjoo Seyyed Saeed
NIGMATULINA, I.
Motlíček Petr, doc. Ing., Ph.D., UPGM (FIT)
IVANOV, V.
GANAPATHIRAJU, A.
Abstrakt

In this paper, we perform an exhaustive evaluation of different
representations to address the intent classification problem in a
Spoken Language Understanding (SLU) setup. We benchmark
three types of systems to perform the SLU intent detection task: 1)
text-based, 2) lattice-based, and a novel 3) multimodal approach.
Our work provides a comprehensive analysis of what could be the
achievable performance of different state-of-the-art SLU systems
under different circumstances, e.g., automatically- vs. manuallygenerated
transcripts. We evaluate the systems on the publicly
available SLURP spoken language resource corpus. Our results
indicate that using richer forms of Automatic Speech Recognition
(ASR) outputs, namely word-consensus-networks, allows the SLU
system to improve in comparison to the 1-best setup (5.5% relative
improvement). However, crossmodal approaches, i.e., learning
from acoustic and text embeddings, obtains performance similar to
the oracle setup, a relative improvement of 17.8% over the 1-best
configuration, being a recommended alternative to overcome the
limitations of working with automatically generated transcripts.

Klíčová slova

Speech Recognition, Human-computer Interaction, Spoken Language Understanding, Word Consensus Networks, Cross-modal Attention

URL
Rok
2023
Strany
1–5
Sborník
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Konference
2023 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE
ISBN
978-1-7281-6327-7
Vydavatel
IEEE Signal Processing Society
Místo
Rhodes Island
DOI
EID Scopus
BibTeX
@inproceedings{BUT187787,
  author="VILLATORO-TELLO, E. and MADIKERI, S. and ZULUAGA-GOMEZ, J. and SHARMA, B. and SARFJOO, S. and NIGMATULINA, I. and MOTLÍČEK, P. and IVANOV, V. and GANAPATHIRAJU, A.",
  title="Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks",
  booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
  year="2023",
  pages="1--5",
  publisher="IEEE Signal Processing Society",
  address="Rhodes Island",
  doi="10.1109/ICASSP49357.2023.10095168",
  isbn="978-1-7281-6327-7",
  url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10095168"
}
Soubory
Projekty
Soudobé metody zpracování, analýzy a zobrazování multimediálních a 3D dat, VUT, Vnitřní projekty VUT, FIT-S-23-8278, zahájení: 2023-03-01, ukončení: 2026-02-28, řešení
Výzkumné skupiny
Pracoviště
Nahoru