Publication Details

SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures

ŽMOLÍKOVÁ Kateřina, DELCROIX Marc, KINOSHITA Keisuke, OCHIAI Tsubasa, NAKATANI Tomohiro, BURGET Lukáš and ČERNOCKÝ Jan. SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures. IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, 2019, pp. 800-814. ISSN 1932-4553. Available from:
Czech title
Neuronová síť poučená o mluvčím pro extrakci cílového mluvčího ze směsi řečových signálů
journal article
Žmolíková Kateřina, Ing., Ph.D. (DCGM FIT BUT)
Delcroix Marc (NTT)
Kinoshita Keisuke (NTT)
Ochiai Tsubasa (NTT)
Nakatani Tomohiro (NTT)
Burget Lukáš, doc. Ing., Ph.D. (DCGM FIT BUT)
Černocký Jan, prof. Dr. Ing. (DCGM FIT BUT)

Speaker extraction, speaker-aware neural network, multi-speaker speech recognition.


The processing of speech corrupted by interfering overlapping speakers is one of the challenging problems with regards to todays automatic speech recognition systems. Recently, approaches based on deep learning have made great progress toward solving this problem. Most of these approaches tackle the problem as speech separation, i.e., they blindly recover all the speakers from the mixture. In some scenarios, such as smart personal devices, we may however be interested in recovering one target speaker froma mixture. In this paper, we introduce Speaker- Beam, a method for extracting a target speaker from the mixture based on an adaptation utterance spoken by the target speaker. Formulating the problem as speaker extraction avoids certain issues such as label permutation and the need to determine the number of speakers in the mixture.With SpeakerBeam, we jointly learn to extract a representation from the adaptation utterance characterizing the target speaker and to use this representation to extract the speaker. We explore several ways to do this, mostly inspired by speaker adaptation in acoustic models for automatic speech recognition. We evaluate the performance on the widely used WSJ0-2mix andWSJ0-3mix datasets, and these datasets modified with more noise or more realistic overlapping patterns. We further analyze the learned behavior by exploring the speaker representations and assessing the effect of the length of the adaptation data. The results show the benefit of including speaker information in the processing and the effectiveness of the proposed method.

IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, ISSN 1932-4553
Institute of Electrical and Electronics Engineers
EID Scopus
   author = "Kate\v{r}ina \v{Z}mol\'{i}kov\'{a} and Marc Delcroix and Keisuke Kinoshita and Tsubasa Ochiai and Tomohiro Nakatani and Luk\'{a}\v{s} Burget and Jan \v{C}ernock\'{y}",
   title = "SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures",
   pages = "800--814",
   journal = "IEEE Journal of Selected Topics in Signal Processing",
   volume = 13,
   number = 4,
   year = 2019,
   ISSN = "1932-4553",
   doi = "10.1109/JSTSP.2019.2922820",
   language = "english",
   url = ""
Back to top