- We study the least squares regression problem \beginalign* \min_\Theta ∈\mathcalS_⊙D,R \|A\Theta-b\|_2, \endalign* where $\mathcal{S}_{\odot D,R}$ is the set of $\Theta$ for which $\Theta = \sum_{r=1}^{R} \theta_1^{(r)} \circ \cdots \circ \theta_D^{(r)}$ for vectors $\theta_d^{(r)} \in \mathbb{R}^{p_d}$ for all $r \in [R]$ and $d \in [D]$, and $\circ$ denotes the outer product of vectors. That is, $\Theta$ is a low-dimensional, low-rank tensor. This is motivated by the fact that the number of parameters in $\Theta$ is only $R \cdot \sum_{d=1}^D p_d$, which is significantly smaller than the $\prod_{d=1}^{D} p_d$ number of parameters in ordinary least squares regression. We consider the above CP decomposition model of tensors $\Theta$, as well as the Tucker decomposition. For both models we show how to apply data dimensionality reduction techniques based on \it sparse random projections $\Phi \in \mathbb{R}^{m \times n}$, with $m \ll n$, to reduce the problem to a much smaller problem $\min_{\Theta} \|\Phi A \Theta - \Phi b\|_2$, for which if $\Theta'$ is a near-optimum to the smaller problem, then it is also a near optimum to the original problem. We obtain significantly smaller dimension and sparsity in $\Phi$ than is possible for ordinary least squares regression, and we also provide a number of numerical simulations supporting our theory.
- Black box variational inference (BBVI) with reparameterization gradients triggered the exploration of divergence measures other than the Kullback-Leibler (KL) divergence, such as alpha divergences. These divergences can be tuned to be more mass-covering (preventing overfitting in complex models), but are also often harder to optimize using Monte-Carlo gradients. In this paper, we view BBVI with generalized divergences as a form of biased importance sampling. The choice of divergence determines a bias-variance tradeoff between the tightness of the bound (low bias) and the variance of its gradient estimators. Drawing on variational perturbation theory of statistical physics, we use these insights to construct a new variational bound which is tighter than the KL bound and more mass covering. Compared to alpha-divergences, its reparameterization gradients have a lower variance. We show in several experiments on Gaussian Processes and Variational Autoencoders that the resulting posterior covariances are closer to the true posterior and lead to higher likelihoods on held-out data.
- We present an approach to automate the process of discovering optimization methods, with a focus on deep learning architectures. We train a Recurrent Neural Network controller to generate a string in a domain specific language that describes a mathematical update equation based on a list of primitive functions, such as the gradient, running average of the gradient, etc. The controller is trained with Reinforcement Learning to maximize the performance of a model after a few epochs. On CIFAR-10, our method discovers several update rules that are better than many commonly used optimizers, such as Adam, RMSProp, or SGD with and without Momentum on a ConvNet model. We introduce two new optimizers, named PowerSign and AddSign, which we show transfer well and improve training on a variety of different tasks and architectures, including ImageNet classification and Google's neural machine translation system.
- Generative Adversarial Networks (GANs) produce systematically better quality samples when class label information is provided., i.e. in the conditional GAN setup. This is still observed for the recently proposed Wasserstein GAN formulation which stabilized adversarial training and allows considering high capacity network architectures such as ResNet. In this work we show how to boost conditional GAN by augmenting available class labels. The new classes come from clustering in the representation space learned by the same GAN model. The proposed strategy is also feasible when no class information is available, i.e. in the unsupervised setup. Our generated samples reach state-of-the-art Inception scores for CIFAR-10 and STL-10 datasets in both supervised and unsupervised setup.
- In recent years, the number of papers on Alzheimer's disease classification has increased dramatically, generating interesting methodological ideas on the use machine learning and feature extraction methods. However, practical impact is much more limited and, eventually, one could not tell which of these approaches are the most efficient. While over 90\% of these works make use of ADNI an objective comparison between approaches is impossible due to variations in the subjects included, image pre-processing, performance metrics and cross-validation procedures. In this paper, we propose a framework for reproducible classification experiments using multimodal MRI and PET data from ADNI. The core components are: 1) code to automatically convert the full ADNI database into BIDS format; 2) a modular architecture based on Nipype in order to easily plug-in different classification and feature extraction tools; 3) feature extraction pipelines for MRI and PET data; 4) baseline classification approaches for unimodal and multimodal features. This provides a flexible framework for benchmarking different feature extraction and classification tools in a reproducible manner. We demonstrate its use on all (1519) baseline T1 MR images and all (1102) baseline FDG PET images from ADNI 1, GO and 2 with SPM-based feature extraction pipelines and three different classification techniques (linear SVM, anatomically regularized SVM and multiple kernel learning SVM). The highest accuracies achieved were: 91% for AD vs CN, 83% for MCIc vs CN, 75% for MCIc vs MCInc, 94% for AD-A$\beta$+ vs CN-A$\beta$- and 72% for MCIc-A$\beta$+ vs MCInc-A$\beta$+. The code is publicly available at https://gitlab.icm-institute.org/aramislab/AD-ML (depends on the Clinica software platform, publicly available at http://www.clinica.run).
- Swarm systems constitute a challenging problem for reinforcement learning (RL) as the algorithm needs to learn decentralized control policies that can cope with limited local sensing and communication abilities of the agents. Although there have been recent advances of deep RL algorithms applied to multi-agent systems, learning communication protocols while simultaneously learning the behavior of the agents is still beyond the reach of deep RL algorithms. However, while it is often difficult to directly define the behavior of the agents, simple communication protocols can be defined more easily using prior knowledge about the given task. In this paper, we propose a number of simple communication protocols that can be exploited by deep reinforcement learning to find decentralized control policies in a multi-robot swarm environment. The protocols are based on histograms that encode the local neighborhood relations of the agents and can also transmit task-specific information, such as the shortest distance and direction to a desired target. In our framework, we use an adaptation of Trust Region Policy Optimization to learn complex collaborative tasks, such as formation building, building a communication link, and pushing an intruder. We evaluate our findings in a simulated 2D-physics environment, and compare the implications of different communication protocols.
- This paper studies how to efficiently learn an optimal latent variable model online from large streaming data. Latent variable models can explain the observed data in terms of unobserved concepts. They are traditionally studied in the unsupervised learning setting, and learned by iterative methods such as the EM. Very few online learning algorithms for latent variable models have been developed, and the most popular one is online EM. Though online EM is computationally efficient, it typically converges to a local optimum. In this work, we motivate and develop SpectralFPL, a novel algorithm to learn latent variable models online from streaming data. SpectralFPL is computationally efficient, and we prove that it quickly learns the global optimum under a bag-of-words model by deriving an $O(\sqrt n)$ regret bound. Experiment results also demonstrate a consistent performance improvement of SpectralFPL over online EM: in both synthetic and real-world experiments, SpectralFPL's performance is similar with or even better than online EM with optimally tuned parameters.
- In this paper, we propose a novel recurrent neural network architecture for speech separation. This architecture is constructed by unfolding the iterations of a sequential iterative soft-thresholding algorithm (ISTA) that solves the optimization problem for sparse nonnegative matrix factorization (NMF) of spectrograms. We name this network architecture deep recurrent NMF (DR-NMF). The proposed DR-NMF network has three distinct advantages. First, DR-NMF provides better interpretability than other deep architectures, since the weights correspond to NMF model parameters, even after training. This interpretability also provides principled initializations that enable faster training and convergence to better solutions compared to conventional random initialization. Second, like many deep networks, DR-NMF is an order of magnitude faster at test time than NMF, since computation of the network output only requires evaluating a few layers at each time step. Third, when a limited amount of training data is available, DR-NMF exhibits stronger generalization and separation performance compared to sparse NMF and state-of-the-art long-short term memory (LSTM) networks. When a large amount of training data is available, DR-NMF achieves lower yet competitive separation performance compared to LSTM networks.
- A latent-variable model is introduced for text matching, inferring sentence representations by jointly optimizing generative and discriminative objectives. To alleviate typical optimization challenges in latent-variable models for text, we employ deconvolutional networks as the sequence decoder (generator), providing learned latent codes with more semantic information and better generalization. Our model, trained in an unsupervised manner, yields stronger empirical predictive performance than a decoder based on Long Short-Term Memory (LSTM), with less parameters and considerably faster training. Further, we apply it to text sequence-matching problems. The proposed model significantly outperforms several strong sentence-encoding baselines, especially in the semi-supervised setting.
- Topological Data Analysis (TDA) is a recent and growing branch of statistics devoted to the study of the shape of the data. In this work we investigate the predictive power of TDA in the context of supervised learning. Since topological summaries, most noticeably the Persistence Diagram, are typically defined in complex spaces, we adopt a kernel approach to translate them into more familiar vector spaces. We define a topological exponential kernel, we characterize it, and we show that, despite not being positive semi-definite, it can be successfully used in regression and classification tasks.
- Although deep Convolutional Neural Network (CNN) has shown better performance in various machine learning tasks, its application is accompanied by a significant increase in storage and computation. Among CNN simplification techniques, parameter pruning is a promising approach which aims at reducing the number of weights of various layers without intensively reducing the original accuracy. In this paper, we propose a novel progressive parameter pruning method, named Structured Probabilistic Pruning (SPP), which efficiently prunes weights of convolutional layers in a probabilistic manner. Unlike existing deterministic pruning approaches, in which the pruned weights of a well-trained model are permanently eliminated, SPP utilizes the relative importance of weights during training iterations, which makes the pruning procedure more accurate by leveraging the accumulated weight importance. Specifically, we introduce an effective weight competition mechanism to emphasize the important weights and gradually undermine the unimportant ones. Experiments indicate that our proposed method has obtained superior performance on ConvNet and AlexNet compared with existing pruning methods. Our pruned AlexNet achieves 4.0 $\sim$ 8.9x (averagely 5.8x) layer-wise speedup in convolutional layers with only 1.3\% top-5 error increase on the ImageNet-2012 validation dataset. We also prove the effectiveness of our method on transfer learning scenarios using AlexNet.
- Sep 22 2017 stat.ML arXiv:1709.07175v1