Exploiting Uncertainty Information in Speaker Verification and Diarization
This thesis considers two models allowing to utilize uncertainty information in the tasks of Automatic Speaker Verification and Speaker Diarization.The first model we consider is a modification of the widely-used Gaussian Probabilistic Linear Discriminant Analysis (G-PLDA) that models the distribution of the vector utterance representations called embeddings. In G-PLDA, the embeddings are assumed to be generated by adding a noise vector sampled from a Gaussian distribution to a speakerdependent vector. We show that when assuming that the noise was instead sampled from a Student's T-distribution, the PLDA model (we call this version heavy-tailed PLDA) can use the uncertainty information when making the verification decisions. Our model is conceptually similar to the HT-PLDA model defined by Kenny et al. in 2010, but, as we show in this thesis, it allows for fast scoring, while the original HT-PLDA definition requires considerable time and computation resources for scoring. We present the algorithm to train our version of HT-PLDA as a generative model. Also, we consider various strategies for discriminatively training the parameters of the model. We test the performance of generatively and discriminatively trained HT-PLDA on the speaker verification task. The results indicate that HT-PLDA performs on par with the standard G-PLDA while having the advantage of being more robust against variations in the data pre-processing. Experiments on the speaker diarization demonstrate that the HT-PLDA model not only provides better performance than the G-PLDA baseline model but also has the advantage of producing better-calibrated Log-Likelihood Ratio (LLR) scores.In the second model, unlike in HT-PLDA, we do not consider the embeddings as the observed data. Instead, in this model, the embeddings are normally distributed hidden variables. The embedding precision carries the information about the quality of the speech segment: for clean long segments, the precision should be high, and for short and noisy utterances, it should be low. We show how such probabilistic embeddings can be incorporated into the G-PLDA framework and how the parameters of the hidden embedding influence its impact when computing the likelihood with this model. In the experiments, we demonstrate how to utilize an existing neural network (NN) embedding extractor to provide not embeddings but parameters of probabilistic embedding distribution. We test the performance of the probabilistic embeddings model on the speaker diarization task. The results demonstrate that this model provides well-calibrated LLR scores allowing for better diarization when no development dataset is available to tune the clustering algorithm.
Speaker Verification, Speaker Diarization, Probabilistic Linear Discriminant Analysis, Uncertainty Propagation, Discriminative Training