- We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. Deep Voice lays the groundwork for truly end-to-end neural speech synthesis. The system comprises five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For the segmentation model, we propose a novel way of performing phoneme boundary detection with deep neural networks using connectionist temporal classification (CTC) loss. For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original. By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component requires laborious feature engineering and extensive domain expertise. Finally, we show that inference with our system can be performed faster than real time and describe optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
- We present an approach to adaptively utilize deep neural networks in order to reduce the evaluation time on new examples without loss of classification performance. Rather than attempting to redesign or approximate existing networks, we propose two schemes that adaptively utilize networks. First, we pose an adaptive network evaluation scheme, where we learn a system to adaptively choose the components of a deep network to be evaluated for each example. By allowing examples correctly classified using early layers of the system to exit, we avoid the computational time associated with full evaluation of the network. Building upon this approach, we then learn a network selection system that adaptively selects the network to be evaluated for each example. We exploit the fact that many examples can be correctly classified using relatively efficient networks and that complex, computationally costly networks are only necessary for a small fraction of examples. By avoiding evaluation of these complex networks for a large fraction of examples, computational time can be dramatically reduced. Empirically, these approaches yield dramatic reductions in computational cost, with up to a 2.8x speedup on state-of-the-art networks from the ImageNet image recognition challenge with minimal (less than 1%) loss of accuracy.
- Feb 28 2017 cs.NE arXiv:1702.07805v1Recurrent neural networks (RNNs) have shown success for many sequence-modeling tasks, but learning long-term dependencies from data remains difficult. This is often attributed to the vanishing gradient problem, which shows that gradient components relating a loss at time $t$ to time $t - \tau$ tend to decay exponentially with $\tau$. Long short-term memory (LSTM) and gated recurrent units (GRUs), the most widely-used RNN architectures, attempt to remedy this problem by making the decay's base closer to 1. NARX RNNs take an orthogonal approach: by including direct connections, or delays, from the past, NARX RNNs make the decay's exponent closer to 0. However, as introduced, NARX RNNs reduce the decay's exponent only by a factor of $n_d$, the number of delays, and simultaneously increase computation by this same factor. We introduce a new variant of NARX RNNs, called MIxed hiSTory RNNs, which addresses these drawbacks. We show that for $\tau \leq 2^{n_d-1}$, MIST RNNs reduce the decay's worst-case exponent from $\tau / n_d$ to $\log \tau$, while maintaining computational complexity that is similar to LSTM and GRUs. We compare MIST RNNs to simple RNNs, LSTM, and GRUs across 4 diverse tasks. MIST RNNs outperform all other methods in 2 cases, and in all cases are competitive.
- This paper is a review of the evolutionary history of deep learning models. It covers from the genesis of neural networks when associationism modeling of the brain is studied, to the models that dominate the last decade of research in deep learning like convolutional neural networks, deep belief networks, and recurrent neural networks, and extends to popular recent models like variational autoencoder and generative adversarial nets. In addition to a review of these models, this paper primarily focuses on the precedents of the models above, examining how the initial ideas are assembled to construct the early models and how these preliminary models are developed into their current forms. Many of these evolutionary paths last more than half a century and have a diversity of directions. For example, CNN is built on prior knowledge of biological vision system; DBN is evolved from a trade-off of modeling power and computation complexity of graphical models and many nowadays models are neural counterparts of ancient linear models. This paper reviews these evolutionary paths and offers a concise thought flow of how these models are developed, and aims to provide a thorough background for deep learning. More importantly, along with the path, this paper summarizes the gist behind these milestones and proposes many directions to guide the future research of deep learning.
- Environmental audio tagging is a newly proposed task to predict the presence or absence of a specific audio event in a chunk. Deep neural network (DNN) based methods have been successfully adopted for predicting the audio tags in the domestic audio scene. In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging. Gated recurrent unit (GRU) based recurrent neural networks (RNNs) are then cascaded to model the long-term temporal structure of the audio signal. To complement the input information, an auxiliary CNN is designed to learn on the spatial features of stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging) of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. Compared with our recent DNN-based method, the proposed structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the development set. The spatial features can further reduce the EER to 0.10. The performance of the end-to-end learning on raw waveforms is also comparable. Finally, on the evaluation set, we get the state-of-the-art performance with 0.12 EER while the performance of the best existing system is 0.15 EER.