Thesis Details
Personal Voice Activity Detection
This work aims to implement, test, and evaluate a speaker-conditioned Voice Activity Detection (VAD) method called Personal VAD. The method builds upon an LSTM-based approach to VAD and its purpose is to introduce a system that can reliably detect speech of a target speaker, while retaining the typical characteristics of a VAD system, mainly in terms of small model size, low latency, and low necessary computational resources. The system is trained to distinguish between three classes: non-speech, target speaker speech, and non-target speaker speech. For this purpose, the method utilizes speaker embeddings as a part of the input feature vector to represent the target speaker. Some of the more heavyweight personal VAD variants also make use of speaker verification scores issued to each frame based on the target embedding, resulting in a more robust system. In addition to the one scoring method presented in the original article, two other scoring approaches are introduced, both outperforming the baseline method and improving the performance even for acoustically challenging conditions.
voice activity detection, speech detection, recurrent neural networks, long short-term memory, LSTM, speaker recognition, speaker embeddings, d-vector
Češka Milan, doc. RNDr., Ph.D. (DITS FIT BUT), člen
Jaroš Jiří, doc. Ing., Ph.D. (DCSY FIT BUT), člen
Orság Filip, Ing., Ph.D. (DITS FIT BUT), člen
Rychlý Marek, RNDr., Ph.D. (DIFS FIT BUT), člen
@bachelorsthesis{FITBT23426, author = "\v{S}imon Sedl\'{a}\v{c}ek", type = "Bachelor's thesis", title = "Personal Voice Activity Detection", school = "Brno University of Technology, Faculty of Information Technology", year = 2021, location = "Brno, CZ", language = "english", url = "https://www.fit.vut.cz/study/thesis/23426/" }