In this paper, we explicitly extract and model jointly multi-view information from short utterances of the individuals, such as speaker identity and text contents. During the development stage, a deep neural network (DNN) that will be used to extract j-vector, is initialized and trained with the speech frames as input and the actual side information of the utterance as flat output block-wise one-hot labels. In the case of text dependent speaker verification, since there is no one-one mapping between input frames and text content labels, a syllable aware DNN is trained to provide compact lexical representation, the s-vector of the utterance. These two vectors (j-vector and s-vector) will be combined together to become a multi-view vector representation of the utterance during the enrollment and evaluation stages. In order to better describe such multi-view vectors, we propose a multi-view probability linear discriminant analysis (PLDA) model which incorporates both within-speaker/text and between-speaker/text variation. In verification we calculate the likelihood that the two multi-view vectors belonging to the same speaker and text or not, and the likelihood will be used in decision-making. Large scale experiments for the open-set condition showed that our approach leads to 0.26\% EER, 2.7\% EER, and 1.34\% EER for impost wrong, impostor correct, and target wrong respectively.
Submitted 20 Apr 2017 to Learning
Published 21 Apr 2017
Updated 21 Apr 2017