Neurální extrakce řeči cílového řečníka

English title

Neural target speech extraction

Language

Czech

Abstract

As speech processing technologies are getting increasingly more applied in the real world, their robustness has become a very important issue. Particularly, the processing of speech corrupted by interfering overlapping speakers is one of the challenging problems today. Speech separation approaches tackle this problem by separating the mixed speech into signals of individual speakers. These methods have made a big headway recently by leveraging the progress in deep learning. In many applications, such as smartphones or digital home assistants, the goal is to enhance the speech signal of one speaker of interest, while suppressing other speakers and noise. In our work, we formulate this problem as target speech extraction and propose to solve it directly, i.e. to use a neural network with the enrollment speech and the mixture as inputs and the extracted speech of the target speaker as the output. We discuss and experimentally show the benefits of this approach compared to conventional speech separation: needlessness of counting speakers in the mixture, or better consistency of the output for longer recordings. We explore different aspects of the neural target speech extraction pipeline, namely the speaker embeddings, methods to inform the neural network about the target speaker, input and output domain, or loss function. Furthermore, we demonstrate how to combine target speech extraction with multi-channel methods, such as neural mask-based beamforming and spatial clustering. Such combinations make use of both conventional statistical methods (for processing the spatial information) and strong modeling power of neural networks. Finally, we apply target speech extraction as a pre-processing for two downstream tasks: automatic speech recognition, and clustering-based diarization. We investigate how to efficiently combine the front-end processing with the downstream systems, including joint optimization, or training with weakly supervised loss function based on speaker labels.

Keywords

target speech extraction, neural networks, multi-channel processing, multi-speaker automatic speech recognition, multi-speaker diarization

Department

Department of Computer Graphics and Multimedia FIT BUT

Degree Programme

Computer Science and Engineering, Field of Study Computer Science and Engineering

Files

Status

defended

Date

23 June 2022

Citation

ŽMOLÍKOVÁ, Kateřina. Neurální extrakce řeči cílového řečníka. Brno, 2021. Ph.D. Thesis. Brno University of Technology, Faculty of Information Technology. 2022-06-23. Supervised by Černocký Jan. Available from: https://www.fit.vut.cz/study/phd-thesis/1009/

BibTeX

@phdthesis{FITPT1009,
    author = "Kate\v{r}ina \v{Z}mol\'{i}kov\'{a}",
    type = "Ph.D. thesis",
    title = "Neur\'{a}ln\'{i} extrakce \v{r}e\v{c}i c\'{i}lov\'{e}ho \v{r}e\v{c}n\'{i}ka",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2022,
    location = "Brno, CZ",
    language = "czech",
    url = "https://www.fit.vut.cz/study/phd-thesis/1009/"
}