Detail výsledku

Learning Speaker Representation for Neural Network Based Multichannel Speaker Extraction

ŽMOLÍKOVÁ, K.; DELCROIX, M.; KINOSHITA, K.; HIGUCHI, T.; OGAWA, A.; NAKATANI, T. Learning Speaker Representation for Neural Network Based Multichannel Speaker Extraction. In Proceedings of ASRU 2017. Okinawa: IEEE Signal Processing Society, 2017. p. 8-15. ISBN: 978-1-5090-4788-8.

Typ

článek ve sborníku konference

Jazyk

angličtina

Autoři

Žmolíková Kateřina, Ing., Ph.D., UPGM (FIT)
Delcroix Marc, FIT (FIT)
Kinoshita Keisuke
Higuchi Takuya
Ogawa Atsunori
Nakatani Tomohiro

Abstrakt

Recently, schemes employing deep neural networks (DNNs) forextracting speech from noisy observation have demonstratedgreat potential for noise robust automatic speech recognition.However, these schemes are not well suited when the interferingnoise is another speaker. To enable extracting a target speakerfrom a mixture of speakers, we have recently proposed to informthe neural network using speaker information extracted froman adaptation utterance from the same speaker. In our previouswork, we explored ways how to inform the network about thespeaker and found a speaker adaptive layer approach to be suitablefor this task. In our experiments, we used speaker featuresdesigned for speaker recognition tasks as the additional speakerinformation, which may not be optimal for the speaker extractiontask. In this paper, we propose a usage of a sequence summarizingscheme enabling to learn the speaker representation jointlywith the network. Furthermore, we extend the previous experimentsto demonstrate the potential of our proposed methodas a front-end for speech recognition and explore the effect ofadditional noise on the performance of the method.

Klíčová slova

speaker extraction, speaker adaptive neural network, multi-speaker speech recognition, speaker representation learning, beamforming

URL

https://www.fit.vut.cz/research/group/speech/public/publi/2017/zmolikova…

Rok

2017

Strany

8–15

Sborník

Proceedings of ASRU 2017

Konference

2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU)

ISBN

978-1-5090-4788-8

Vydavatel

IEEE Signal Processing Society

Místo

Okinawa

DOI

10.1109/ASRU.2017.8268910

UT WoS

000426066100002

EID Scopus

2-s2.0-85050535526

BibTeX

@inproceedings{BUT144503,
  author="Kateřina {Žmolíková} and Marc {Delcroix} and Keisuke {Kinoshita} and Takuya {Higuchi} and Atsunori {Ogawa} and Tomohiro {Nakatani}",
  title="Learning Speaker Representation for Neural Network Based Multichannel Speaker Extraction",
  booktitle="Proceedings of ASRU 2017",
  year="2017",
  pages="8--15",
  publisher="IEEE Signal Processing Society",
  address="Okinawa",
  doi="10.1109/ASRU.2017.8268910",
  isbn="978-1-5090-4788-8",
  url="http://www.fit.vutbr.cz/research/groups/speech/publi/2017/zmolikova_asru2017.pdf"
}

Projekty

NTT - Parametrizace s obohacováním řeči pro robustní automatické rozpoznávání řeči s velkým objemem trénovacích dat, NTT, zahájení: 2017-10-01, ukončení: 2018-09-30, ukončen
Zpracování, zobrazování a analýza multimediálních a 3D dat, VUT, Vnitřní projekty VUT, FIT-S-17-3984, zahájení: 2017-03-01, ukončení: 2020-02-29, ukončen

Výzkumné skupiny

Výzkumná skupina dolování dat z řeči BUT Speech@FIT (VZ SPEECH)

Pracoviště

Ústav počítačové grafiky a multimédií (UPGM)