Detail výsledku

Single Channel Target Speaker Extraction and Recognition with Speaker Beam

DELCROIX, M.; ŽMOLÍKOVÁ, K.; KINOSHITA, K.; OGAWA, A.; NAKATANI, T. Single Channel Target Speaker Extraction and Recognition with Speaker Beam. In Proceedings of ICASSP 2018. Calgary: IEEE Signal Processing Society, 2018. p. 5554-5558. ISBN: 978-1-5386-4658-8.

Typ

článek ve sborníku konference

Jazyk

anglicky

Autoři

Delcroix Marc, FIT (FIT)
Žmolíková Kateřina, Ing., Ph.D., UPGM (FIT)
Kinoshita Keisuke, FIT (FIT)
Ogawa Atsunori, FIT (FIT)
Nakatani Tomohiro, FIT (FIT)

Abstrakt

This paper addresses the problem of single channel speech recognitionof a target speaker in a mixture of speech signals. We proposeto exploit auxiliary speaker information provided by an adaptationutterance from the target speaker to extract and recognize only thatspeaker. Using such auxiliary information, we can build a speakerextraction neural network (NN) that is independent of the numberof sources in the mixture, and that can track speakers across differentutterances, which are two challenging issues occurring withconventional approaches for speech recognition of mixtures. Wecall such an informed speaker extraction scheme "SpeakerBeam".SpeakerBeam exploits a recently developed context adaptive deepNN (CADNN) that allows tracking speech from a target speaker usinga speaker adaptation layer, whose parameters are adjusted dependingon auxiliary features representing the target speaker characteristics.SpeakerBeam was previously investigated for speaker extractionusing a microphone array. In this paper, we demonstrate thatit is also efficient for single channel speaker extraction. The speakeradaptation layer can be employed either to build a speaker adaptiveacoustic model that recognizes only the target speaker or a maskbasedspeaker extraction network that extracts the target speech fromthe speech mixture signal prior to recognition. We also show thatthe latter speaker extraction network can be optimized jointly withan acoustic model to further improve ASR performance.

Klíčová slova

Speech Recognition, Speech mixtures, Speakerextraction, Adaptation, Robust ASR

URL

https://www.fit.vut.cz/research/group/speech/public/publi/2018/delcroix… PDF

Rok

2018

Strany

5554–5558

Sborník

Proceedings of ICASSP 2018

Konference

IEEE International Conference on Acoustics, Speech and Signal Processing

ISBN

978-1-5386-4658-8

Vydavatel

IEEE Signal Processing Society

Místo

Calgary

DOI

10.1109/ICASSP.2018.8462661

UT WoS

000446384605144

EID Scopus

2-s2.0-85054290595

BibTeX

@inproceedings{BUT155043,
  author="Marc {Delcroix} and Kateřina {Žmolíková} and Keisuke {Kinoshita} and Atsunori {Ogawa} and Tomohiro {Nakatani}",
  title="Single Channel Target Speaker Extraction and Recognition with Speaker Beam",
  booktitle="Proceedings of ICASSP 2018",
  year="2018",
  pages="5554--5558",
  publisher="IEEE Signal Processing Society",
  address="Calgary",
  doi="10.1109/ICASSP.2018.8462661",
  isbn="978-1-5386-4658-8",
  url="https://www.fit.vut.cz/research/publication/11721/"
}

Soubory

pdf delcroix_icassp2018_0005554.pdf 1 MB

Projekty

IT4Innovations excellence in science, MŠMT, Národní program udržitelnosti II, LQ1602, zahájení: 2016-01-01, ukončení: 2020-12-31, ukončen
NTT - Parametrizace s obohacováním řeči pro robustní automatické rozpoznávání řeči s velkým objemem trénovacích dat, NTT, zahájení: 2017-10-01, ukončení: 2018-09-30, ukončen

Výzkumné skupiny

Výzkumná skupina dolování dat z řeči BUT Speech@FIT (VZ SPEECH)

Pracoviště

Ústav počítačové grafiky a multimédií (UPGM)