0

Latent space representation for multi-target speaker detection and identification with a sparse dataset using Triplet neural networks

Triplet Neural Networks (TNNs) improve speaker recognition performance, especially with small datasets, outperforming the traditional $i$-vector with PLDA approach.

Year
2019
Venue
arXiv 2019
Authors
4
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/1910.01463v2ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We present an approach to tackle the speaker recognition problem using Triplet Neural Networks. Currently, the $i$-vector representation with probabilistic linear discriminant analysis (PLDA) is the most commonly used technique to solve this problem, due to high classification accuracy with a relatively short computation time. In this paper, we explore a neural network approach, namely Triplet Neural Networks (TNNs), to built a latent space for different classifiers to solve the Multi-Target Speaker Detection and Identification Challenge Evaluation 2018 (MCE 2018) dataset. This training set contains $i$-vectors from 3,631 speakers, with only 3 samples for each speaker, thus making speaker recognition a challenging task. When using the train and development set for training both the TNN and baseline model (i.e., similarity evaluation directly on the $i$-vector representation), our proposed model outperforms the baseline by 23%. When reducing the training data to only using the train set, our method results in 309 confusions for the Multi-target speaker identification task, which is 46% better than the baseline model. These results show that the representational power of TNNs is especially evident when training on small datasets with few instances available per class.

Authors

4