Result Details

Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks

VILLATORO-TELLO, E.; MADIKERI, S.; ZULUAGA-GOMEZ, J.; SHARMA, B.; SARFJOO, S.; NIGMATULINA, I.; MOTLÍČEK, P.; IVANOV, V.; GANAPATHIRAJU, A. Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Rhodes Island: IEEE Signal Processing Society, 2023. p. 1-5. ISBN: 978-1-7281-6327-7.

Type

conference paper

Language

English

Authors

VILLATORO-TELLO, E.
Madikeri Srikanth, FIT (FIT)
ZULUAGA-GOMEZ, J.
SHARMA, B.
Sarfjoo Seyyed Saeed
NIGMATULINA, I.
Motlíček Petr, doc. Ing., Ph.D., DCGM (FIT)
IVANOV, V.
GANAPATHIRAJU, A.

Abstract

In this paper, we perform an exhaustive evaluation of different
representations to address the intent classification problem in a
Spoken Language Understanding (SLU) setup. We benchmark
three types of systems to perform the SLU intent detection task: 1)
text-based, 2) lattice-based, and a novel 3) multimodal approach.
Our work provides a comprehensive analysis of what could be the
achievable performance of different state-of-the-art SLU systems
under different circumstances, e.g., automatically- vs. manuallygenerated
transcripts. We evaluate the systems on the publicly
available SLURP spoken language resource corpus. Our results
indicate that using richer forms of Automatic Speech Recognition
(ASR) outputs, namely word-consensus-networks, allows the SLU
system to improve in comparison to the 1-best setup (5.5% relative
improvement). However, crossmodal approaches, i.e., learning
from acoustic and text embeddings, obtains performance similar to
the oracle setup, a relative improvement of 17.8% over the 1-best
configuration, being a recommended alternative to overcome the
limitations of working with automatically generated transcripts.

Keywords

Speech Recognition, Human-computer Interaction, Spoken Language Understanding, Word Consensus Networks, Cross-modal Attention

URL

Published

2023

Pages

1–5

Proceedings

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Conference

2023 IEEE International Conference on Acoustics, Speech and Signal Processing IEEE

ISBN

978-1-7281-6327-7

Publisher

IEEE Signal Processing Society

Place

Rhodes Island

DOI

10.1109/ICASSP49357.2023.10095168

EID Scopus

2-s2.0-85177587537

BibTeX

@inproceedings{BUT187787,
  author="VILLATORO-TELLO, E. and MADIKERI, S. and ZULUAGA-GOMEZ, J. and SHARMA, B. and SARFJOO, S. and NIGMATULINA, I. and MOTLÍČEK, P. and IVANOV, V. and GANAPATHIRAJU, A.",
  title="Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks",
  booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
  year="2023",
  pages="1--5",
  publisher="IEEE Signal Processing Society",
  address="Rhodes Island",
  doi="10.1109/ICASSP49357.2023.10095168",
  isbn="978-1-7281-6327-7",
  url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10095168"
}

Files

pdf villatoro-tello_icassp2023_10095168.pdf 976 kB

Projects

Soudobé metody zpracování, analýzy a zobrazování multimediálních a 3D dat, BUT, Vnitřní projekty VUT, FIT-S-23-8278, start: 2023-03-01, end: 2026-02-28, running

Research groups

Výzkumná skupina dolování dat z řeči BUT Speech@FIT (RG SPEECH)

Departments

Ústav počítačové grafiky a multimédií (DCGM)