Despite the significant progress made in the recent years in dictating single-talker speech, the progress made in speaker independent multi-talker mixed speech separation and tracing, often referred to as the cocktail-party problem, has been less impressive. In this paper we propose a novel technique for attacking this problem. The core of our technique is permutation invariant training (PIT), which aims at minimizing the source stream reconstruction error no matter how labels are ordered. This is achieved by aligning labels to the output streams automatically during the training time. This strategy effectively solves the label permutation problem observed in deep learning based techniques for speech separation. More interestingly, our approach can integrate speaker tracing in the PIT framework so that separation and tracing can be carried out in one step and trained end-to-end. This is achieved using recurrent neural networks (RNNs) by forcing separated frames belonging to the same speaker to be aligned to the same output layer during training. Furthermore, the computational cost introduced by PIT is very small compared to the RNN computation during training and is zero during separation. We evaluated PIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks and found that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), deep clustering (DPCL) and deep attractor network (DANet), and generalizes well over unseen speakers and languages.
Submitted 18 Mar 2017 to Sound
Published 21 Mar 2017