- Quantum mechanics fundamentally forbids deterministic discrimination of quantum states and processes. However, the ability to optimally distinguish various classes of quantum data is an important primitive in quantum information science. In this work, we train near-term quantum circuits to classify data represented by non-orthogonal quantum probability distributions using the Adam stochastic optimization algorithm. This is achieved by iterative interactions of a classical device with a quantum processor to discover the parameters of an unknown non-unitary quantum circuit. This circuit learns to simulates the unknown structure of a generalized quantum measurement, or Positive-Operator-Value-Measure (POVM), that is required to optimally distinguish possible distributions of quantum inputs. Notably we use universal circuit topologies, with a theoretically motivated circuit design, which guarantees that our circuits can in principle learn to perform arbitrary input-output mappings. Our numerical simulations show that shallow quantum circuits could be trained to discriminate among various pure and mixed quantum states exhibiting a trade-off between minimizing erroneous and inconclusive outcomes with comparable performance to theoretically optimal POVMs. We train the circuit on different classes of quantum data and evaluate the generalization error on unseen mixed quantum states. This generalization power hence distinguishes our work from standard circuit optimization and provides an example of quantum machine learning for a task that has inherently no classical analogue.
- One of the main difficulties in analyzing neural networks is the non-convexity of the loss function which may have many bad local minima. In this paper, we study the landscape of neural networks for binary classification tasks. Under mild assumptions, we prove that after adding one special neuron with a skip connection to the output, or one special neuron per layer, every local minimum is a global minimum.
- We develop the first Bayesian Optimization algorithm, BLOSSOM, which selects between multiple alternative acquisition functions and traditional local optimization at each step. This is combined with a novel stopping condition based on expected regret. This pairing allows us to obtain the best characteristics of both local and Bayesian optimization, making efficient use of function evaluations while yielding superior convergence to the global minimum on a selection of optimization problems, and also halting optimization once a principled and intuitive stopping condition has been fulfilled.
- Feature hashing, also known as \em the hashing trick, introduced by Weinberger et al. (2009), is one of the key techniques used in scaling-up machine learning algorithms. Loosely speaking, feature hashing uses a random sparse projection matrix $A : \mathbb{R}^n \to \mathbb{R}^m$ (where $m \ll n$) in order to reduce the dimension of the data from $n$ to $m$ while approximately preserving the Euclidean norm. Every column of $A$ contains exactly one non-zero entry, equals to either $-1$ or $1$. Weinberger et al. showed tail bounds on $\|Ax\|_2^2$. Specifically they showed that for every $\varepsilon, \delta$, if $\|x\|_{\infty} / \|x\|_2$ is sufficiently small, and $m$ is sufficiently large, then $$\Pr[ \; | \;\|Ax\|_2^2 - \|x\|_2^2\; | < \varepsilon \|x\|_2^2 \;] \ge 1 - \delta \;.$$ These bounds were later extended by Dasgupta \etal (2010) and most recently refined by Dahlgaard et al. (2017), however, the true nature of the performance of this key technique, and specifically the correct tradeoff between the pivotal parameters $\|x\|_{\infty} / \|x\|_2, m, \varepsilon, \delta$ remained an open question. We settle this question by giving tight asymptotic bounds on the exact tradeoff between the central parameters, thus providing a complete understanding of the performance of feature hashing. We complement the asymptotic bound with empirical data, which shows that the constants "hiding" in the asymptotic notation are, in fact, very close to $1$, thus further illustrating the tightness of the presented bounds in practice.
- Although interactive learning puts the user into the loop, the learner remains mostly a black box for the user. Understanding the reasons behind queries and predictions is important when assessing how the learner works and, in turn, trust. Consequently, we propose the novel framework of explanatory interactive learning: in each step, the learner explains its interactive query to the user, and she queries of any active classifier for visualizing explanations of the corresponding predictions. We demonstrate that this can boost the predictive and explanatory powers of and the trust into the learned model, using text (e.g. SVMs) and image classification (e.g. neural networks) experiments as well as a user study.
- We study a recent model of collaborative PAC learning where $k$ players with $k$ different tasks collaborate to learn a single classifier that works for all tasks. Previous work showed that when there is a classifier that has very small error on all tasks, there is a collaborative algorithm that finds a single classifier for all tasks and it uses $O(\ln^2 (k))$ times the sample complexity to learn a single task. In this work, we design new algorithms for both the realizable and the non-realizable settings using only $O(\ln (k))$ times the sample complexity to learn a single task. The sample complexity upper bounds of our algorithms match previous lower bounds and in some range of parameters are even better than previous algorithms that are allowed to output different classifiers for different tasks.
- Currently, progressively larger deep neural networks are trained on ever growing data corpora. As this trend is only going to increase in the future, distributed training schemes are becoming increasingly relevant. A major issue in distributed training is the limited communication bandwidth between contributing nodes or prohibitive communication cost in general. These challenges become even more pressing, as the number of computation nodes increases. To counteract this development we propose sparse binary compression (SBC), a compression framework that allows for a drastic reduction of communication cost for distributed training. SBC combines existing techniques of communication delay and gradient sparsification with a novel binarization method and optimal weight update encoding to push compression gains to new limits. By doing so, our method also allows us to smoothly trade-off gradient sparsity and temporal sparsity to adapt to the requirements of the learning task. Our experiments show, that SBC can reduce the upstream communication on a variety of convolutional and recurrent neural network architectures by more than four orders of magnitude without significantly harming the convergence speed in terms of forward-backward passes. For instance, we can train ResNet50 on ImageNet in the same number of iterations to the baseline accuracy, using $\times 3531$ less bits or train it to a $1\%$ lower accuracy using $\times 37208$ less bits. In the latter case, the total upstream communication required is cut from 125 terabytes to 3.35 gigabytes for every participating client.
- In recent years, due to the booming development of online social networks, fake news for various commercial and political purposes has been appearing in large numbers and widespread in the online world. With deceptive words, online social network users can get infected by these online fake news easily, which has brought about tremendous effects on the offline society already. An important goal in improving the trustworthiness of information in online social networks is to identify the fake news timely. This paper aims at investigating the principles, methodologies and algorithms for detecting fake news articles, creators and subjects from online social networks and evaluating the corresponding performance. This paper addresses the challenges introduced by the unknown characteristics of fake news and diverse connections among news articles, creators and subjects. Based on a detailed data analysis, this paper introduces a novel automatic fake news credibility inference model, namely FakeDetector. Based on a set of explicit and latent features extracted from the textual information, FakeDetector builds a deep diffusive network model to learn the representations of news articles, creators and subjects simultaneously. Extensive experiments have been done on a real-world fake news dataset to compare FakeDetector with several state-of-the-art models, and the experimental results have demonstrated the effectiveness of the proposed model.
- Programming has been an important skill for researchers and practitioners in computer science and other related areas. To learn basic programing skills, a long-time systematic training is usually required for beginners. According to a recent market report, the computer software market is expected to continue expanding at an accelerating speed, but the market supply of qualified software developers can hardly meet such a huge demand. In recent years, the surge of text generation research works provides the opportunities to address such a dilemma through automatic program synthesis. In this paper, we propose to make our try to solve the program synthesis problem from a data mining perspective. To address the problem, a novel generative model, namely EgoCoder, will be introduced in this paper. EgoCoder effectively parses program code into abstract syntax trees (ASTs), where the tree nodes will contain the program code/comment content and the tree structure can capture the program logic flows. Based on a new unit model called Hsu, EgoCoder can effectively capture both the hierarchical and sequential patterns in the program ASTs. Extensive experiments will be done to compare EgoCoder with the state-of-the-art text generation methods, and the experimental results have demonstrated the effectiveness of EgoCoder in addressing the program synthesis problem.
- We propose a novel data-dependent structured gradient regularizer to increase the robustness of neural networks vis-a-vis adversarial perturbations. Our regularizer can be derived as a controlled approximation from first principles, leveraging the fundamental link between training with noise and regularization. It adds very little computational overhead during learning and is simple to implement generically in standard deep learning frameworks. Our experiments provide strong evidence that structured gradient regularization can act as an effective first line of defense against attacks based on low-level signal corruption.
- We consider a new stochastic gradient descent algorithm for efficiently solving general min-max optimization problems that arise naturally in distributionally robust learning. By focusing on the entire dataset, current approaches do not scale well. We address this issue by initially focusing on a subset of the data and progressively increasing this support to statistically cover the entire dataset.
- In recent years, the Word2Vec model trained with the Negative Sampling loss function has shown state-of-the-art results in a number of machine learning tasks, including language modeling tasks, such as word analogy and word similarity, and in recommendation tasks, through Prod2Vec, an extension that applies to modeling user shopping activity and user preferences. Several methods that aim to improve upon the standard Negative Sampling loss have been proposed. In our paper we pursue more sophisticated Negative Sampling, by leveraging ideas from the field of Generative Adversarial Networks (GANs), and propose Adversarial Negative Sampling. We build upon the recent progress made in stabilizing the training objective of GANs in the discrete data setting, and introduce a new GAN-Word2Vec model.We evaluate our model on the task of basket completion, and show significant improvements in performance over Word2Vec trained using standard loss functions, including Noise Contrastive Estimation and Negative Sampling.
- Combining Bayesian nonparametrics and a forward model selection strategy, we construct parsimonious Bayesian deep networks (PBDNs) that infer capacity-regularized network architectures from the data and require neither cross-validation nor fine-tuning when training the model. One of the two essential components of a PBDN is the development of a special infinite-wide single-hidden-layer neural network, whose number of active hidden units can be inferred from the data. The other one is the construction of a greedy layer-wise learning algorithm that uses a forward model selection criterion to determine when to stop adding another hidden layer. We develop both Gibbs sampling and stochastic gradient descent based maximum a posteriori inference for PBDNs, providing state-of-the-art classification accuracy and interpretable data subtypes near the decision boundaries, while maintaining low computational complexity for out-of-sample prediction.
- Parameterizing the approximate posterior of a generative model with neural networks has become a common theme in recent machine learning research. While providing appealing flexibility, this approach makes it difficult to impose or assess structural constraints such as conditional independence. We propose a framework for learning representations that relies on Auto-Encoding Variational Bayes and whose search space is constrained via kernel-based measures of independence. In particular, our method employs the $d$-variable Hilbert-Schmidt Independence Criterion (dHSIC) to enforce independence between the latent representations and arbitrary nuisance factors. We show how to apply this method to a range of problems, including the problems of learning invariant representations and the learning of interpretable representations. We also present a full-fledged application to single-cell RNA sequencing (scRNA-seq). In this setting the biological signal in mixed in complex ways with sequencing errors and sampling effects. We show that our method out-performs the state-of-the-art in this domain.
- A regression method for proportional, or fractional, data with mixed effects is outlined, designed for analysis of datasets in which the outcomes have substantial weight at the bounds. In such cases a normal approximation is particularly unsuitable as it can result in incorrect inference. To resolve this problem, we employ a logistic regression model and then apply a bootstrap method to correct conservative confidence intervals. This paper outlines the theory of the method, and demonstrates its utility using simulated data. Working code for the R platform is provided through the package glmmboot, available on CRAN.
- We introduce a Bayesian Gaussian process latent variable model that explicitly captures spatial correlations in data using a parameterized spatial kernel and leveraging structure-exploiting algebra on the model covariance matrices for computational tractability. Inference is made tractable through a collapsed variational bound with similar computational complexity to that of the traditional Bayesian GP-LVM. Inference over partially-observed test cases is achieved by optimizing a "partially-collapsed" bound. Modeling high-dimensional time series systems is enabled through use of a dynamical GP latent variable prior. Examples imputing missing data on images and super-resolution imputation of missing video frames demonstrate the model.
- Conditional generative adversarial networks (cGAN) have led to large improvements in the task of conditional image generation, which lies at the heart of computer vision. The major focus so far has been on performance improvement, while there has been little effort in making cGAN more robust to noise or leveraging structure in the output space of the model. The end-to-end regression (of the generator) might lead to arbitrarily large errors in the output, which is unsuitable for the application of such networks to real-world systems. In this work, we introduce a novel conditional GAN, called RoCGAN, which adds implicit constraints to address the issue. Our proposed model augments the generator with an unsupervised pathway, which encourages the outputs of the generator to span the target manifold even in the presence of large amounts of noise. We prove that RoCGAN shares similar theoretical properties as GAN and experimentally verify that the proposed model outperforms existing state-of-the-art cGAN architectures by a large margin in a variety of domains including images from natural scenes and faces.
- In this paper we propose solving localized multiple kernel learning (LMKL) using LMKL-Net, a feedforward deep neural network. In contrast to previous works, as a learning principle we propose \em parameterizing both the gating function for learning kernel combination weights and the multiclass classifier in LMKL using an attentional network (AN) and a multilayer perceptron (MLP), respectively. In this way we can learn the (nonlinear) decision function in LMKL (approximately) by sequential applications of AN and MLP. Empirically on benchmark datasets we demonstrate that overall LMKL-Net can not only outperform the state-of-the-art MKL solvers in terms of accuracy, but also be trained about \em two orders of magnitude faster with much smaller memory footprint for large-scale learning.
- Approximate Bayesian computation is an established and popular method for likelihood-free inference with applications in many disciplines. The effectiveness of the method depends critically on the availability of well performing summary statistics. Summary statistic selection relies heavily on domain knowledge and carefully engineered features, and can be a laborious time consuming process. Since the method is sensitive to data dimensionality, the process of selecting summary statistics must balance the need to include informative statistics and the dimensionality of the feature vector. This paper proposes to treat the problem of dynamically selecting an appropriate summary statistic from a given pool of candidate summary statistics as a multi-armed bandit problem. This allows approximate Bayesian computation rejection sampling to dynamically focus on a distribution over well performing summary statistics as opposed to a fixed set of statistics. The proposed method is unique in that it does not require any pre-processing and is scalable to a large number of candidate statistics. This enables efficient use of a large library of possible time series summary statistics without prior feature engineering. The proposed approach is compared to state-of-the-art methods for summary statistics selection using a challenging test problem from the systems biology literature.
- We want to compute the integral of a function or the expectation of a random variable with minimal cost and use, for the algorithm and for upper bounds of the complexity, i.i.d. samples. Under certain assumptions it is possible to select a sample size on ground of a variance estimation, or - more generally - on ground of an estimation of a (central absolute) $p$-moment. That way one can guarantee a small absolute error with high probability, the problem is thus called solvable. The expected cost of the method depends on the $p$-moment of the random variable, which can be arbitrarily large. Our lower bounds apply not only to methods based on i.i.d. samples but also to general randomized algorithms. They show that - up to constants - the cost of the algorithm is optimal in terms of accuracy, confidence level, and norm of the particular input random variable. Since the considered classes of random variables or integrands are very large, the worst case cost would be infinite. Nevertheless one can define adaptive stopping rules such that for each input the expected cost is finite. We contrast these positive results with examples of integration problems that are not solvable.
- We study the problem of learning personalized decision policies from observational data while accounting for possible unobserved confounding in the data-generating process. Unlike previous approaches which assume unconfoundedness, i.e., no unobserved confounders affected treatment assignment as well as outcome, we calibrate policy learning for realistic violations of this unverifiable assumption with uncertainty sets motivated by sensitivity analysis in causal inference. Our framework for confounding-robust policy improvement optimizes the minimax regret of a candidate policy against a baseline or reference "status quo" policy, over a uncertainty set around nominal propensity weights. We prove that if the uncertainty set is well-specified, robust policy learning can do no worse than the baseline, and only improve if the data supports it. We characterize the adversarial subproblem and use efficient algorithmic solutions to optimize over parametrized spaces of decision policies such as logistic treatment assignment. We assess our methods on synthetic data and a large clinical trial, demonstrating that confounded selection can hinder policy learning and lead to unwarranted harm, while our robust approach guarantees safety and focuses on well-evidenced improvement.
- Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $\mu(X)$, which quantifies the hardness of compressing a data set for logistic regression. $\mu(X)$ has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded $\mu(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1\pm\varepsilon)$-coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.
- We introduce algorithms for online, full-information prediction that are competitive with contextual tree experts of unknown complexity, in both probabilistic and adversarial settings. We show that by incorporating a probabilistic framework of structural risk minimization into existing adaptive algorithms, we can robustly learn not only the presence of stochastic structure when it exists (leading to constant as opposed to $\mathcal{O}(\sqrt{T})$ regret), but also the correct model order. We thus obtain regret bounds that are competitive with the regret of an optimal algorithm that possesses strong side information about both the complexity of the optimal contextual tree expert and whether the process generating the data is stochastic or adversarial. These are the first constructive guarantees on simultaneous adaptivity to the model and the presence of stochasticity.
- Deep neural networks generalize remarkably well without explicit regularization even in the strongly over-parametrized regime. This success suggests that some form of implicit regularization must be at work. By applying a modified version of the coding theorem from algorithmic information theory and by performing extensive empirical analysis of random neural networks, we argue that the parameter function map of deep neural networks is exponentially biased towards functions with lower descriptional complexity. We show explicitly for supervised learning of Boolean functions that the intrinsic simplicity bias of deep neural networks means that they generalize significantly better than an unbiased learning algorithm does. The superior generalization due to simplicity bias can be explained using PAC-Bayes theory, which yields useful generalization error bounds for learning Boolean functions with a wide range of complexities. Finally, we provide evidence that deeper neural networks trained on the CIFAR10 data set exhibit stronger simplicity bias than shallow networks do, which may help explain why deeper networks generalize better than shallow ones do.
- May 23 2018 stat.ME arXiv:1805.08518v1In this work we present a new approach, which we call MISFIT, to fitting functional data models with sparsely and irregularly sampled data. The limitations of current methods have created major challenges in the fitting of more complex nonlinear models. Indeed, currently many models cannot be consistently estimated unless one assumes that the number of observed points per curve grows sufficiently quickly with the sample size. In contrast, we demonstrate that MISFIT, which is based on a multiple imputation framework, has the potential to produce consistent estimates without such an assumption. Just as importantly, it propagates the uncertainty of not having completely observed curves, allowing for a more accurate assessment of the uncertainty of parameter estimates, something that most methods currently cannot accomplish. This work is motivated by a longitudinal study on macrocephaly, or atypically large head size, in which electronic medical records allow for the collection of a great deal of data. However, the sampling is highly variable from child to child. Using the MISFIT approach we are able to clearly demonstrate that the development of pathologic conditions related to macrocephaly is associated with both the overall head circumference of the children as well as the velocity of their head growth.
- By providing a simple and efficient way of computing low-variance gradients of continuous random variables, the reparameterization trick has become the technique of choice for training a variety of latent variable models. However, it is not applicable to a number of important continuous distributions. We introduce an alternative approach to computing reparameterization gradients based on implicit differentiation and demonstrate its broader applicability by applying it to Gamma, Beta, Dirichlet, and von Mises distributions, which cannot be used with the classic reparameterization trick. Our experiments show that the proposed approach is faster and more accurate than the existing gradient estimators for these distributions.
- Generative models for source code are an interesting structured prediction problem, requiring to reason about both hard syntactic and semantic constraints as well as about natural, likely programs. We present a novel model for this problem that uses a graph to represent the intermediate state of the generated output. The generative procedure interleaves grammar-driven expansion steps with graph augmentation and neural message passing steps. An experimental evaluation shows that our new model can generate semantically meaningful expressions, outperforming a range of strong baselines.
- Distributed asynchronous SGD has become widely used for deep learning in large-scale systems, but remains notorious for its instability when increasing the number of workers. In this work, we study the dynamics of distributed asynchronous SGD under the lens of Lagrangian mechanics. Using this description, we introduce the concept of energy to describe the optimization process and derive a sufficient condition ensuring its stability as long as the collective energy induced by the active workers remains below the energy of a target synchronous process. Making use of this criterion, we derive a stable distributed asynchronous optimization procedure, GEM, that estimates and maintains the energy of the asynchronous system below or equal to the energy of sequential SGD with momentum. Experimental results highlight the stability and speedup of GEM compared to existing schemes, even when scaling to one hundred asynchronous workers. Results also indicate better generalization compared to the targeted SGD with momentum.
- Tensor decomposition has been widely applied to find low-rank representations for real-world data and more recently for neural-network parameters too. For the latter, the unfolded matrices may not always be low-rank because the modes of the parameter tensor do not usually have any physical meaning that can be exploited for efficiency. This raises the following question: how can we find low-rank structures when the tensor modes do not have any physical meaning associated with them? For this purpose, we propose a new decomposition method in this paper. Our method uses reshaping and reordering operations that are strictly more general than the unfolding operation. These operations enable us to discover new low-rank structures that are beyond the reach of existing tensor methods. We prove an important theoretical result establishing conditions under which our method results in a unique solution. The experimental results confirm the correctness of our theoretical works and the effectiveness of our methods for weight compression in deep neural networks.
- While a typical supervised learning framework assumes that the inputs and the outputs are measured at the same levels of granularity, many applications, including global mapping of disease, only have access to outputs at a much coarser level than that of the inputs. Aggregation of outputs makes generalization to new inputs much more difficult. We consider an approach to this problem based on variational learning with a model of output aggregation and Gaussian processes, where aggregation leads to intractability of the standard evidence lower bounds. We propose new bounds and tractable approximations, leading to improved prediction accuracy and scalability to large datasets, while explicitly taking uncertainty into account. We develop a framework which extends to several types of likelihoods, including the Poisson model for aggregated count data. We apply our framework to a challenging and important problem, the fine-scale spatial modelling of malaria incidence, with over 1 million observations.
- Meta-learning is a promising method to achieve efficient training method towards deep neural net and has been attracting increases interests in recent years. But most of the current methods are still not capable to train complex neuron net model with long-time training process. In this paper, a novel second-order meta-optimizer, named Meta-learning with Hessian-Free(MLHF) approach, is proposed based on the Hessian Free approach as the framework. Two recurrent neural networks are established to generate the damping and the precondition matrix of this Hessian free framework. A series of techniques to meta-train the MLHF towards stable and reinforce the meta-training of this optimizer, including the gradient calculation of $H$, and use experiment replay on $w^0$. Numerical experiments on deep convolution neural nets, including CUDA-convnet and resnet18(v2), with datasets of cifar10 and ILSVRC2012, indicate that the MLHF shows good and continuous training performance during the whole long-time training process, i.e., both the rapid-decreasing early stage and the steadily-deceasing later stage, and so is a promising meta-learning framework towards elevating the training efficiency in real-world deep neural nets.
- We study the quantification of uncertainty of Convolutional Neural Networks (CNNs) based on gradient metrics. Unlike the classical softmax entropy, such metrics gather information from all layers of the CNN. We show for the (E)MNIST data set that for several such metrics we achieve the same meta classification accuracy -- i.e. the task of classifying correctly predicted labels as correct and incorrectly predicted ones as incorrect without knowing the actual label -- as for entropy thresholding. Meta classification rates for out of sample images can be increased when using entropy together with several gradient based metrics as input quantities for a meta-classifier. This proves that our gradient based metrics do not contain the same information as the entropy. We also apply meta classification to concepts not used during training: EMNIST/Omniglot letters, CIFAR10 and noise. Meta classifiers only trained on the uncertainty metrics of classes available during training usually do not perform equally well for all the unknown concepts letters, CIFAR10 and uniform noise. If we however allow the meta classifier to be trained on uncertainty metrics including some samples of some or all of the categories, meta classification for concepts remote from MNIST digits can be improved considerably.
- May 23 2018 stat.ME arXiv:1805.08423v1Expectation propagation is a general prescription for approximation of integrals in statistical inference problems. Its literature is mainly concerned with Bayesian inference scenarios. However, expectation propagation can also be used to approximate integrals arising in frequentist statistical inference. We focus on likelihood-based inference for binary response mixed models and show that fast and accurate quadrature-free inference can be realized for the probit link case with multivariate random effects and higher levels of nesting. The approach is supported by asymptotic theory in which expectation propagation is seen to provide consistent estimation of the exact likelihood surface. Numerical studies reveal the availability of fast, highly accurate and scalable methodology for binary mixed model analysis.
- The focus in machine learning has branched beyond training classifiers on a single task to investigating how previously acquired knowledge in a source domain can be leveraged to facilitate learning in a related target domain, known as inductive transfer learning. Three active lines of research have independently explored transfer learning using neural networks. In weight transfer, a model trained on the source domain is used as an initialization point for a network to be trained on the target domain. In deep metric learning, the source domain is used to construct an embedding that captures class structure in both the source and target domains. In few-shot learning, the focus is on generalizing well in the target domain based on a limited number of labeled examples. We compare state-of-the-art methods from these three paradigms and also explore hybrid adapted-embedding methods that use limited target-domain data to fine tune embeddings constructed from source-domain data. We conduct a systematic comparison of methods in a variety of domains, varying the number of labeled instances available in the target domain ($k$), as well as the number of target-domain classes. We reach three principle conclusions: (1) Deep embeddings are far superior, compared to weight transfer, as a starting point for inter-domain transfer or model re-use (2) Our hybrid methods robustly outperform every few-shot learning and every deep metric learning method previously proposed, with a mean error reduction of 30% over state-of-the-art. (3) Among loss functions for discovering embeddings, the histogram loss (Ustinova & Lempitsky, 2016) is most robust. We hope our results will motivate a unification of research in weight transfer, deep metric learning, and few-shot learning.
- We study the inverse optimal control problem in social sciences: we aim at learning a user's true cost function from the observed temporal behavior. In contrast to traditional phenomenological works that aim to learn a generative model to fit the behavioral data, we propose a novel variational principle and treat user as a reinforcement learning algorithm, which acts by optimizing his cost function. We first propose a unified KL framework that generalizes existing maximum entropy inverse optimal control methods. We further propose a two-step Wasserstein inverse optimal control framework. In the first step, we compute the optimal measure with a novel mass transport equation. In the second step, we formulate the learning problem as a generative adversarial network. In two real world experiments - recommender systems and social networks, we show that our framework obtains significant performance gains over both existing inverse optimal control methods and point process based generative models.
- The great success of deep learning shows that its technology contains profound truth, and understanding its internal mechanism not only has important implications for the development of its technology and effective application in various fields, but also provides meaningful insights into the understanding of human brain mechanism. At present, most of the theoretical research on deep learning is based on mathematics. This dissertation proposes that the neural network of deep learning is a physical system, examines deep learning from three different perspectives: microscopic, macroscopic, and physical world views, answers multiple theoretical puzzles in deep learning by using physics principles. For example, from the perspective of quantum mechanics and statistical physics, this dissertation presents the calculation methods for convolution calculation, pooling, normalization, and Restricted Boltzmann Machine, as well as the selection of cost functions, explains why deep learning must be deep, what characteristics are learned in deep learning, why Convolutional Neural Networks do not have to be trained layer by layer, and the limitations of deep learning, etc., and proposes the theoretical direction and basis for the further development of deep learning now and in the future. The brilliance of physics flashes in deep learning, we try to establish the deep learning technology based on the scientific theory of physics.
- Despite the remarkable successes of generative adversarial networks (GANs) in many applications, theoretical understandings of their performance is still limited. In this paper, we present a simple shallow GAN model fed by high-dimensional input data. The dynamics of the training process of the proposed model can be exactly analyzed in the high-dimensional limit. In particular, by using the tool of scaling limits of stochastic processes, we show that the macroscopic quantities measuring the quality of the training process converge to a deterministic process that is characterized as the unique solution of a finite-dimensional ordinary differential equation (ODE). The proposed model is simple, but its training process already exhibits several different phases that can mimic the behaviors of more realistic GAN models used in practice. Specifically, depending on the choice of the learning rates, the training process can reach either a successful, a failed, or an oscillating phase. By studying the steady-state solutions of the limiting ODEs, we obtain a phase diagram that precisely characterizes the conditions under which each phase takes place. Although this work focuses on a simple GAN model, the analysis methods developed here might prove useful in the theoretical understanding of other variants of GANs with more advanced training algorithms.
- For neural networks (NNs) with rectified linear unit (ReLU) or binary activation functions, we show that their training can be accomplished in a reduced parameter space. Specifically, the weights in each neuron can be trained on the unit sphere, as opposed to the entire space, and the threshold can be trained in a bounded interval, as opposed to the real line. We show that the NNs in the reduced parameter space are mathematically equivalent to the standard NNs with parameters in the whole space. The reduced parameter space shall facilitate the optimization procedure for the network training, as the search space becomes (much) smaller. We demonstrate the improved training performance using numerical examples.
- In this paper, we propose a novel maximum causal Tsallis entropy (MCTE) framework for imitation learning which can efficiently learn a sparse multi-modal policy distribution from demonstrations. We provide the full mathematical analysis of the proposed framework. First, the optimal solution of an MCTE problem is shown to be a sparsemax distribution, whose supporting set can be adjusted. The proposed method has advantages over a softmax distribution in that it can exclude unnecessary actions by assigning zero probability. Second, we prove that an MCTE problem is equivalent to robust Bayes estimation in the sense of the Brier score. Third, we propose a maximum causal Tsallis entropy imitation learning (MCTEIL) algorithm with a sparse mixture density network (sparse MDN) by modeling mixture weights using a sparsemax distribution. In particular, we show that the causal Tsallis entropy of an MDN encourages exploration and efficient mixture utilization while Boltzmann Gibbs entropy is less effective. We validate the proposed method in two simulation studies and MCTEIL outperforms existing imitation learning methods in terms of average returns and learning multi-modal policies.
- While deep reinforcement learning has successfully solved many challenging control tasks, its real-world applicability has been limited by the inability to ensure the safety of learned policies. We propose an approach to verifiable reinforcement learning by training decision tree policies, which can represent complex policies (since they are nonparametric), yet can be efficiently verified using existing techniques (since they are highly structured). The challenge is that decision tree policies are difficult to train. We propose VIPER, an algorithm that combines ideas from model compression and imitation learning to learn decision tree policies guided by a DNN policy (called the oracle) and its Q-function, and show that it substantially outperforms two baselines. We use VIPER to (i) learn a provably robust decision tree policy for a variant of Atari Pong with a symbolic state space, (ii) learn a decision tree policy for a toy game based on Pong that provably never loses, and (iii) learn a provably stable decision tree policy for cart-pole. In each case, the decision tree policy achieves performance equal to that of the original DNN policy.
- This paper considers the problem of implementing large-scale gradient descent algorithms in a distributed computing setting in the presence of \em straggling processors. To mitigate the effect of the stragglers, it has been previously proposed to encode the data with an erasure-correcting code and decode at the master server at the end of the computation. We, instead, propose to encode the second-moment of the data with a low density parity-check (LDPC) code. The iterative decoding algorithms for LDPC codes have very low computational overhead and the number of decoding iterations can be made to automatically adjust with the number of stragglers in the system. We show that for a random model for stragglers, the proposed moment encoding based gradient descent method can be viewed as the stochastic gradient descent method. This allows us to obtain convergence guarantees for the proposed solution. Furthermore, the proposed moment encoding based method is shown to outperform the existing schemes in a real distributed computing setup.
- May 23 2018 stat.AP arXiv:1805.08323v1The statistical modeling of multivariate count data observed on a space-time lattice has generally focused on using a hierarchical modeling approach where space-time correlation structure is placed on a continuous, unobservable, process. The count distribution is then assumed to be conditionally independent given the latent process. However, in many real-world applications, especially in the modeling of criminal or terrorism data, the conditional independence between the count distributions is inappropriate. In the absence of spatial correlation, the Integer Auto-Regressive Conditionally Heteroskedastic (INGARCH) process could be used to capture this data model dependence however this model does not allow for any unexplained spatial correlation in the data. In this manuscript we propose a class of models that extends the INGARCH process to account for small scale spatial variation, which we refer to as a SPINGARCH process. The resulting model allows both data model dependence as well as dependence in a latent structure. We demonstrate how second-order properties can be used to differentiate between models in this class. Finally, we apply Bayesian inference for the SPINGARCH process demonstrating its use in modeling the spatio-temporal structure of burglaries in Chicago from 2010-2015.
- The celebrated Monte Carlo method estimates a quantity that is expensive to compute by random sampling. We propose adaptive Monte Carlo optimization: a general framework for discrete optimization of an expensive-to-compute function by adaptive random sampling. Applications of this framework have already appeared in machine learning but are tied to their specific contexts and developed in isolation. We take a unified view and show that the framework has broad applicability by applying it on several common machine learning problems: $k$-nearest neighbors, hierarchical clustering and maximum mutual information feature selection. On real data we show that this framework allows us to develop algorithms that confer a gain of a magnitude or two over exact computation. We also characterize the performance gain theoretically under regularity assumptions on the data that we verify in real world data. The code is available at https://github.com/govinda-kamath/combinatorial_MAB.
- In this paper, we propose the Self-Attention Generative Adversarial Network (SAGAN) which allows attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other. Furthermore, recent work has shown that generator conditioning affects GAN performance. Leveraging this insight, we apply spectral normalization to the GAN generator and find that this improves training dynamics. The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset. Visualization of the attention layers shows that the generator leverages neighborhoods that correspond to object shapes rather than local regions of fixed shape.
- May 23 2018 stat.AP arXiv:1805.08314v1Internet research on search engine quality and validity of results demand much concern. Thus, the focus in our study has been to measure the impact of quotation marks usage on the internet search outputs in terms of google search outcomes distributions, through Benford Law. The current paper is focused on applying a Benford Law analysis on two related types of internet searches distinguished by the usage or absence of quotation marks. Both search results values are assumed as variables. We found that the first digit of outcomes does not follow the Benford Law first digit of numbers in the case of searching text without quotation marks. Unexpectedly, the Benford Law is obeyed when quotation marks are used, even if the variability of search outcomes is considerably reduced. By studying outputs demonstrating influences of (apparently at first) "details", in using a search engine, the authors are able to further warn the users concerning the validity of such outputs.
- We propose a framework for ensuring safe behavior of a reinforcement learning agent when the reward function may be difficult to specify. In order to do this, we rely on the existence of demonstrations from expert policies, and we provide a theoretical framework for the agent to optimize in the space of rewards consistent with its existing knowledge. We propose two methods to solve the resulting optimization: an exact ellipsoid-based method and a method in the spirit of the "follow-the-perturbed-leader" algorithm. Our experiments demonstrate the behavior of our algorithm in both discrete and continuous problems. The trained agent safely avoids states with potential negative effects while imitating the behavior of the expert in the other states.
- The success of deep learning models is heavily tied to the use of massive amount of labeled data and excessively long training time. With the emergence of intelligent edge applications that use these models, the critical challenge is to obtain the same inference capability on a resource-constrained device while providing adaptability to cope with the dynamic changes in the data. We propose AgileNet, a novel lightweight dictionary-based few-shot learning methodology which provides reduced complexity deep neural network for efficient execution at the edge while enabling low-cost updates to capture the dynamics of the new data. Evaluations of state-of-the-art few-shot learning benchmarks demonstrate the superior accuracy of AgileNet compared to prior arts. Additionally, AgileNet is the first few-shot learning approach that prevents model updates by eliminating the knowledge obtained from the primary training. This property is ensured through the dictionaries learned by our novel end-to-end structured decomposition, which also reduces the memory footprint and computation complexity to match the edge device constraints.
- The intrinsic error tolerance of neural network (NN) makes approximate computing a promising technique to improve the energy efficiency of NN inference. Conventional approximate computing focuses on balancing the efficiency-accuracy trade-off for existing pre-trained networks, which can lead to suboptimal solutions. In this paper, we propose AxTrain, a hardware-oriented training framework to facilitate approximate computing for NN inference. Specifically, AxTrain leverages the synergy between two orthogonal methods---one actively searches for a network parameters distribution with high error tolerance, and the other passively learns resilient weights by numerically incorporating the noise distributions of the approximate hardware in the forward pass during the training phase. Experimental results from various datasets with near-threshold computing and approximation multiplication strategies demonstrate AxTrain's ability to obtain resilient neural network parameters and system energy efficiency improvement.
- We introduce geomstats, a python package that performs computations on manifolds such as hyperspheres, hyperbolic spaces, spaces of symmetric positive definite matrices and Lie groups of transformations. We provide efficient and extensively unit-tested implementations of these manifolds, together with useful Riemannian metrics and associated Exponential and Logarithm maps. The corresponding geodesic distances provide a range of intuitive choices of Machine Learning loss functions. We also give the corresponding Riemannian gradients. The operations implemented in geomstats are available with different computing backends such as numpy, tensorflow and keras. We have enabled GPU implementation and integrated geomstats manifold computations into keras deep learning framework. This paper also presents a review of manifolds in machine learning and an overview of the geomstats package with examples demonstrating its use for efficient and user-friendly Riemannian geometry.
- Density estimation is a fundamental problem in statistical learning. This problem is especially challenging for complex high-dimensional data due to the curse of dimensionality. A promising solution to this problem is given here in an inference-free hierarchical framework that is built on score matching. We revisit the Bayesian interpretation of the score function and the Parzen score matching, and construct a multilayer perceptron with a scalable objective for learning the energy (i.e. the unnormalized log-density), which is then optimized with stochastic gradient descent. In addition, the resulting deep energy estimator network (DEEN) is designed as products of experts. We present the utility of DEEN in learning the energy, the score function, and in single-step denoising experiments for synthetic and high-dimensional data. We also diagnose stability problems in the direct estimation of the score function that had been observed for denoising autoencoders.