Detail výsledku

Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks

VILLATORO-TELLO, E.; MADIKERI, S.; ZULUAGA-GOMEZ, J.; SHARMA, B.; SARFJOO, S.; NIGMATULINA, I.; MOTLÍČEK, P.; IVANOV, V.; GANAPATHIRAJU, A. Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Rhodes Island: IEEE Signal Processing Society, 2023. p. 1-5. ISBN: 978-1-7281-6327-7.

Typ

článek ve sborníku konference

Jazyk

anglicky

Autoři

VILLATORO-TELLO, E.
Madikeri Srikanth, FIT (FIT)
ZULUAGA-GOMEZ, J.
SHARMA, B.
Sarfjoo Seyyed Saeed
NIGMATULINA, I.
Motlíček Petr, doc. Ing., Ph.D., UPGM (FIT)
IVANOV, V.
GANAPATHIRAJU, A.

Abstrakt

In this paper, we perform an exhaustive evaluation of different
representations to address the intent classification problem in a
Spoken Language Understanding (SLU) setup. We benchmark
three types of systems to perform the SLU intent detection task: 1)
text-based, 2) lattice-based, and a novel 3) multimodal approach.
Our work provides a comprehensive analysis of what could be the
achievable performance of different state-of-the-art SLU systems
under different circumstances, e.g., automatically- vs. manuallygenerated
transcripts. We evaluate the systems on the publicly
available SLURP spoken language resource corpus. Our results
indicate that using richer forms of Automatic Speech Recognition
(ASR) outputs, namely word-consensus-networks, allows the SLU
system to improve in comparison to the 1-best setup (5.5% relative
improvement). However, crossmodal approaches, i.e., learning
from acoustic and text embeddings, obtains performance similar to
the oracle setup, a relative improvement of 17.8% over the 1-best
configuration, being a recommended alternative to overcome the
limitations of working with automatically generated transcripts.

Klíčová slova

Speech Recognition, Human-computer Interaction, Spoken Language Understanding, Word Consensus Networks, Cross-modal Attention

URL

Rok

2023

Strany

1–5

Sborník

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Konference

2023 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE

ISBN

978-1-7281-6327-7

Vydavatel

IEEE Signal Processing Society

Místo

Rhodes Island

DOI

10.1109/ICASSP49357.2023.10095168

EID Scopus

2-s2.0-85177587537

BibTeX

@inproceedings{BUT187787,
  author="VILLATORO-TELLO, E. and MADIKERI, S. and ZULUAGA-GOMEZ, J. and SHARMA, B. and SARFJOO, S. and NIGMATULINA, I. and MOTLÍČEK, P. and IVANOV, V. and GANAPATHIRAJU, A.",
  title="Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks",
  booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
  year="2023",
  pages="1--5",
  publisher="IEEE Signal Processing Society",
  address="Rhodes Island",
  doi="10.1109/ICASSP49357.2023.10095168",
  isbn="978-1-7281-6327-7",
  url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10095168"
}

Soubory

pdf villatoro-tello_icassp2023_10095168.pdf 976 kB

Projekty

Soudobé metody zpracování, analýzy a zobrazování multimediálních a 3D dat, VUT, Vnitřní projekty VUT, FIT-S-23-8278, zahájení: 2023-03-01, ukončení: 2026-02-28, řešení

Výzkumné skupiny

Výzkumná skupina dolování dat z řeči BUT Speech@FIT (VZ SPEECH)

Pracoviště

Ústav počítačové grafiky a multimédií (UPGM)