# Neural and Evolutionary Computing (cs.NE)

• We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. Deep Voice lays the groundwork for truly end-to-end neural speech synthesis. The system comprises five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For the segmentation model, we propose a novel way of performing phoneme boundary detection with deep neural networks using connectionist temporal classification (CTC) loss. For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original. By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component requires laborious feature engineering and extensive domain expertise. Finally, we show that inference with our system can be performed faster than real time and describe optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
• We present an approach to adaptively utilize deep neural networks in order to reduce the evaluation time on new examples without loss of classification performance. Rather than attempting to redesign or approximate existing networks, we propose two schemes that adaptively utilize networks. First, we pose an adaptive network evaluation scheme, where we learn a system to adaptively choose the components of a deep network to be evaluated for each example. By allowing examples correctly classified using early layers of the system to exit, we avoid the computational time associated with full evaluation of the network. Building upon this approach, we then learn a network selection system that adaptively selects the network to be evaluated for each example. We exploit the fact that many examples can be correctly classified using relatively efficient networks and that complex, computationally costly networks are only necessary for a small fraction of examples. By avoiding evaluation of these complex networks for a large fraction of examples, computational time can be dramatically reduced. Empirically, these approaches yield dramatic reductions in computational cost, with up to a 2.8x speedup on state-of-the-art networks from the ImageNet image recognition challenge with minimal (less than 1%) loss of accuracy.
• Recurrent neural networks (RNNs) have shown success for many sequence-modeling tasks, but learning long-term dependencies from data remains difficult. This is often attributed to the vanishing gradient problem, which shows that gradient components relating a loss at time $t$ to time $t - \tau$ tend to decay exponentially with $\tau$. Long short-term memory (LSTM) and gated recurrent units (GRUs), the most widely-used RNN architectures, attempt to remedy this problem by making the decay's base closer to 1. NARX RNNs take an orthogonal approach: by including direct connections, or delays, from the past, NARX RNNs make the decay's exponent closer to 0. However, as introduced, NARX RNNs reduce the decay's exponent only by a factor of $n_d$, the number of delays, and simultaneously increase computation by this same factor. We introduce a new variant of NARX RNNs, called MIxed hiSTory RNNs, which addresses these drawbacks. We show that for $\tau \leq 2^{n_d-1}$, MIST RNNs reduce the decay's worst-case exponent from $\tau / n_d$ to $\log \tau$, while maintaining computational complexity that is similar to LSTM and GRUs. We compare MIST RNNs to simple RNNs, LSTM, and GRUs across 4 diverse tasks. MIST RNNs outperform all other methods in 2 cases, and in all cases are competitive.
• Feb 28 2017 cs.LG cs.NE stat.ML arXiv:1702.07800v1
This paper is a review of the evolutionary history of deep learning models. It covers from the genesis of neural networks when associationism modeling of the brain is studied, to the models that dominate the last decade of research in deep learning like convolutional neural networks, deep belief networks, and recurrent neural networks, and extends to popular recent models like variational autoencoder and generative adversarial nets. In addition to a review of these models, this paper primarily focuses on the precedents of the models above, examining how the initial ideas are assembled to construct the early models and how these preliminary models are developed into their current forms. Many of these evolutionary paths last more than half a century and have a diversity of directions. For example, CNN is built on prior knowledge of biological vision system; DBN is evolved from a trade-off of modeling power and computation complexity of graphical models and many nowadays models are neural counterparts of ancient linear models. This paper reviews these evolutionary paths and offers a concise thought flow of how these models are developed, and aims to provide a thorough background for deep learning. More importantly, along with the path, this paper summarizes the gist behind these milestones and proposes many directions to guide the future research of deep learning.
• Environmental audio tagging is a newly proposed task to predict the presence or absence of a specific audio event in a chunk. Deep neural network (DNN) based methods have been successfully adopted for predicting the audio tags in the domestic audio scene. In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging. Gated recurrent unit (GRU) based recurrent neural networks (RNNs) are then cascaded to model the long-term temporal structure of the audio signal. To complement the input information, an auxiliary CNN is designed to learn on the spatial features of stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging) of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. Compared with our recent DNN-based method, the proposed structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the development set. The spatial features can further reduce the EER to 0.10. The performance of the end-to-end learning on raw waveforms is also comparable. Finally, on the evaluation set, we get the state-of-the-art performance with 0.12 EER while the performance of the best existing system is 0.15 EER.
• We propose to study equivariance in deep neural networks through parameter symmetries. In particular, given a group G that acts discretely on the input and output of a standard neural network layer $\phi_W$, we show that equivariance of $\phi_W$ is linked to the symmetry group of network parameters W. We then propose a sparse parameter-sharing scheme to induce the desirable symmetry on W. Under some conditions on the action of G, our procedure for tying the parameters achieves G-equivariance and guarantee sensitivity to all other permutation groups outside G. We demonstrate the relation of our approach to recently-proposed "structured" neural layers such as group-convolution and graph-convolution which leads to new insights and improvement of these operations.
• Artificial neural networks can be trained with relatively low-precision floating-point and fixed-point arithmetic, using between one and 16 bits. Previous works have focused on relatively wide-but-shallow, feed-forward networks. We introduce a quantization scheme that is compatible with training very deep neural networks. Quantizing the network activations in the middle of each batch-normalization module can greatly reduce the amount of memory and computational power needed, with little loss in accuracy.
• Recent work on generative modeling of text has found that variational auto-encoders (VAE) incorporating LSTM decoders perform worse than simpler LSTM language models (Bowman et al., 2015). This negative result is so far poorly understood, but has been attributed to the propensity of LSTM decoders to ignore conditioning information from the encoder. In this paper, we experiment with a new type of decoder for VAE: a dilated CNN. By changing the decoder's dilation architecture, we control the effective context from previously generated words. In experiments, we find that there is a trade off between the contextual capacity of the decoder and the amount of encoding information used. We show that with the right decoder, VAE can outperform LSTM language models. We demonstrate perplexity gains on two datasets, representing the first positive experimental result on the use VAE for generative modeling of text. Further, we conduct an in-depth investigation of the use of VAE (with our new decoding architecture) for semi-supervised and unsupervised labeling tasks, demonstrating gains over several strong baselines.