Non-Parallel Voice Conversion

Czech title

Language

English

Abstract

Voice conversion (VC) aims at converting the voice of source speaker to the voice of target speaker. It is popular in funny Internet videos but has also series of serious use cases, such as dubbing of audiovisual material and anonymization of voice (for example for witness protection). As it can serve for spoofing of voice identification systems, it is also an important tool for development spoofing detectors and counter-measures. Training VC models has mainly been on parallel audios (ie. two speakers uttering the same text) and on high quality audio material. The goal of this thesis was to investigate developing VC on non-parallel data and with low quality signals, mainly from publicly available dataset VoxCeleb.

This work follows the state-of-the-art AutoVC architecture defined by Qian et al. It is based on neural network (NN) autoencoders, aiming to separate speech into content- and speaker-dependent embedding. The target speech is then obtained by replacing source speaker embedding by the target speaker one. We have improved Qian's architecture to process low-quality audio by experimenting with different speaker embeddings (d-vectors vs. x-vectors), introducing a speaker classifier from content embeddings in an adversarial setup, and tuning the size of content embeddings imposing an information bottleneck to the autoencoder. Also, we have defined another adversarial architecture by comparing original content embeddings with those obtained after the VC process. The results of experiments prove that non-parallel VC on low-quality data is indeed doable. The resulting audios were not so good as in case of using high-quality ones, but the speaker verification results after spoofing by proposed system have clearly shown a shift of voice characteristics toward the target speakers.

Keywords

voice conversion, speech processing, x-vector, d-vector, autoencoder, verification, spoofing, wavenet, neural networks

Department

Department of Computer Graphics and Multimedia FIT BUT

Degree Programme

Information Technology, Field of Study Computer Graphics and Multimedia

Files

Status

defended, grade A

Date

15 July 2020

Reviewer

Plchot Oldřich, Ing., Ph.D.

Committee

Černocký Jan, prof. Dr. Ing. (DCGM FIT BUT), předseda
Bařina David, Ing., Ph.D. (DCGM FIT BUT), člen
Beran Vítězslav, doc. Ing., Ph.D. (DCGM FIT BUT), člen
Grézl František, Ing., Ph.D. (DCGM FIT BUT), člen
Herout Adam, prof. Ing., Ph.D. (DCGM FIT BUT), člen
Křivka Zbyněk, Ing., Ph.D. (DIFS FIT BUT), člen

Citation

BRUKNER, Jan. Non-Parallel Voice Conversion. Brno, 2020. Master's Thesis. Brno University of Technology, Faculty of Information Technology. 2020-07-15. Supervised by Černocký Jan. Available from: https://www.fit.vut.cz/study/thesis/19207/

BibTeX

@mastersthesis{FITMT19207,
    author = "Jan Brukner",
    type = "Master's thesis",
    title = "Non-Parallel Voice Conversion",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2020,
    location = "Brno, CZ",
    language = "english",
    url = "https://www.fit.vut.cz/study/thesis/19207/"
}