- This paper focuses on style transfer on the basis of non-parallel text. This is an instance of a broader family of problems including machine translation, decipherment, and sentiment modification. The key technical challenge is to separate the content from desired text characteristics such as sentiment. We leverage refined alignment of latent representations across mono-lingual text corpora with different characteristics. We deliberately modify encoded examples according to their characteristics, requiring the reproduced instances to match available examples with the altered characteristics as a population. We demonstrate the effectiveness of this cross-alignment method on three tasks: sentiment modification, decipherment of word substitution ciphers, and recovery of word order.
- We study causal inference in a multi-environment setting, in which the functional relations for producing the variables from their direct causes remain the same across environments, while the distribution of exogenous noises may vary. We introduce the idea of using the invariance of the functional relations of the variables to their causes across a set of environments. We define a notion of completeness for a causal inference algorithm in this setting and prove the existence of such algorithm by proposing the baseline algorithm. Additionally, we present an alternate algorithm that has significantly improved computational and sample complexity compared to the baseline algorithm. The experiment results show that the proposed algorithm outperforms the other existing algorithms.
- Generative adversarial networks (GANs) can implicitly learn rich distributions over images, audio, and data which are hard to model with an explicit likelihood. We present a practical Bayesian formulation for unsupervised and semi-supervised learning with GANs, in conjunction with stochastic gradient Hamiltonian Monte Carlo to marginalize the weights of the generator and discriminator networks. The resulting approach is straightforward and obtains good performance without any standard interventions such as feature matching, or mini-batch discrimination. By exploring an expressive posterior over the parameters of the generator, the Bayesian GAN avoids mode-collapse, produces interpretable candidate samples with notable variability, and in particular provides state-of-the-art quantitative results for semi-supervised learning on benchmarks including SVHN, CelebA, and CIFAR-10, outperforming DCGAN, Wasserstein GANs, and DCGAN ensembles.
- Deep networks have recently been shown to be vulnerable to universal perturbations: there exist very small image-agnostic perturbations that cause most natural images to be misclassified by such classifiers. In this paper, we propose the first quantitative analysis of the robustness of classifiers to universal perturbations, and draw a formal link between the robustness to universal perturbations, and the geometry of the decision boundary. Specifically, we establish theoretical bounds on the robustness of classifiers under two decision boundary models (flat and curved models). We show in particular that the robustness of deep networks to universal perturbations is driven by a key property of their curvature: there exists shared directions along which the decision boundary of deep networks is systematically positively curved. Under such conditions, we prove the existence of small universal perturbations. Our analysis further provides a novel geometric method for computing universal perturbations, in addition to explaining their properties.
- The goal of this paper is to analyze the geometric properties of deep neural network classifiers in the input space. We specifically study the topology of classification regions created by deep networks, as well as their associated decision boundary. Through a systematic empirical investigation, we show that state-of-the-art deep nets learn connected classification regions, and that the decision boundary in the vicinity of datapoints is flat along most directions. We further draw an essential connection between two seemingly unrelated properties of deep networks: their sensitivity to additive perturbations in the inputs, and the curvature of their decision boundary. The directions where the decision boundary is curved in fact remarkably characterize the directions to which the classifier is the most vulnerable. We finally leverage a fundamental asymmetry in the curvature of the decision boundary of deep nets, and propose a method to discriminate between original images, and images perturbed with small adversarial examples. We show the effectiveness of this purely geometric approach for detecting small adversarial perturbations in images, and for recovering the labels of perturbed images.
- In several physical systems, important properties characterizing the system itself are theoretically related with specific degrees of freedom. Although standard Monte Carlo simulations provide an effective tool to accurately reconstruct the physical configurations of the system, they are unable to isolate the different contributions corresponding to different degrees of freedom. Here we show that unsupervised deep learning can become a valid support to MC simulation, coupling useful insights in the phases detection task with good reconstruction performance. As a testbed we consider the 2D XY model, showing that a deep neural network based on variational autoencoders can detect the continuous Kosterlitz-Thouless (KT) transitions, and that, if endowed with the appropriate constrains, they generate configurations with meaningful physical content.
- May 29 2017 cs.LG arXiv:1705.09518v1Graph-based methods have been quite successful in solving unsupervised and semi-supervised learning problems, as they provide a means to capture the underlying geometry of the dataset. It is often desirable for the constructed graph to satisfy two properties: first, data points that are similar in the feature space should be strongly connected on the graph, and second, the class label information should vary smoothly with respect to the graph, where smoothness is measured using the spectral properties of the graph Laplacian matrix. Recent works have justified some of these smoothness conditions by showing that they are strongly linked to the semi-supervised smoothness assumption and its variants. In this work, we reinforce this connection by viewing the problem from a graph sampling theoretic perspective, where class indicator functions are treated as bandlimited graph signals (in the eigenvector basis of the graph Laplacian) and label prediction as a bandlimited reconstruction problem. Our approach involves analyzing the bandwidth of class indicator signals generated from statistical data models with separable and nonseparable classes. These models are quite general and mimic the nature of most real-world datasets. Our results show that in the asymptotic limit, the bandwidth of any class indicator is also closely related to the geometry of the dataset. This allows one to theoretically justify the assumption of bandlimitedness of class indicator signals, thereby providing a sampling theoretic interpretation of graph-based semi-supervised classification.
- Automatically learning features, especially robust features, has attracted much attention in the machine learning community. In this paper, we propose a new method to learn non-linear robust features by taking advantage of the data manifold structure. We first follow the commonly used trick of the trade, that is learning robust features with artificially corrupted data, which are training samples with manually injected noise. Following the idea of the auto-encoder, we first assume features should contain much information to well reconstruct the input from its corrupted copies. However, merely reconstructing clean input from its noisy copies could make data manifold in the feature space noisy. To address this problem, we propose a new method, called Incremental Auto-Encoders, to iteratively denoise the extracted features. We assume the noisy manifold structure is caused by a diffusion process. Consequently, we reverse this specific diffusion process to further contract this noisy manifold, which results in an incremental optimization of model parameters . Furthermore, we show these learned non-linear features can be stacked into a hierarchy of features. Experimental results on real-world datasets demonstrate the proposed method can achieve better classification performances.
- Online music services are increasing in popularity. They enable us to analyze people's music listening behavior based on play logs. Although it is known that people listen to music based on topic (e.g., rock or jazz), we assume that when a user is addicted to an artist, s/he chooses the artist's songs regardless of topic. Based on this assumption, in this paper, we propose a probabilistic model to analyze people's music listening behavior. Our main contributions are three-fold. First, to the best of our knowledge, this is the first study modeling music listening behavior by taking into account the influence of addiction to artists. Second, by using real-world datasets of play logs, we showed the effectiveness of our proposed model. Third, we carried out qualitative experiments and showed that taking addiction into account enables us to analyze music listening behavior from a new viewpoint in terms of how people listen to music according to the time of day, how an artist's songs are listened to by people, etc. We also discuss the possibility of applying the analysis results to applications such as artist similarity computation and song recommendation.
- Trajectory Prediction of dynamic objects is a widely studied topic in the field of artificial intelligence. Thanks to a large number of applications like predicting abnormal events, navigation system for the blind, etc. there have been many approaches to attempt learning patterns of motion directly from data using a wide variety of techniques ranging from hand-crafted features to sophisticated deep learning models for unsupervised feature learning. All these approaches have been limited by problems like inefficient features in the case of hand crafted features, large error propagation across the predicted trajectory and no information of static artefacts around the dynamic moving objects. We propose an end to end deep learning model to learn the motion patterns of humans using different navigational modes directly from data using the much popular sequence to sequence model coupled with a soft attention mechanism. We also propose a novel approach to model the static artefacts in a scene and using these to predict the dynamic trajectories. The proposed method, tested on trajectories of pedestrians, consistently outperforms previously proposed state of the art approaches on a variety of large scale data sets. We also show how our architecture can be naturally extended to handle multiple modes of movement (say pedestrians, skaters, bikers and buses) simultaneously.
- K-Nearest Neighbours (k-NN) is a popular classification and regression algorithm, yet one of its main limitations is the difficulty in choosing the number of neighbours. We present a Bayesian algorithm to compute the posterior probability distribution for k given a target point within a data-set, efficiently and without the use of Markov Chain Monte Carlo (MCMC) methods or simulation - alongside an exact solution for distributions within the exponential family. The central idea is that data points around our target are generated by the same probability distribution, extending outwards over the appropriate, though unknown, number of neighbours. Once the data is projected onto a distance metric of choice, we can transform the choice of k into a change-point detection problem, for which there is an efficient solution: we recursively compute the probability of the last change-point as we move towards our target, and thus de facto compute the posterior probability distribution over k. Applying this approach to both a classification and a regression UCI data-sets, we compare favourably and, most importantly, by removing the need for simulation, we are able to compute the posterior probability of k exactly and rapidly. As an example, the computational time for the Ripley data-set is a few milliseconds compared to a few hours when using a MCMC approach.
- May 29 2017 cs.LG arXiv:1705.09406v1Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.
- Deep generative models based on Generative Adversarial Networks (GANs) have demonstrated impressive sample quality but in order to work they require a careful choice of architecture, parameter initialization, and selection of hyper-parameters. This fragility is in part due to a dimensional mismatch between the model distribution and the true distribution, causing their density ratio and the associated f-divergence to be undefined. We overcome this fundamental limitation and propose a new regularization approach with low computational cost that yields a stable GAN training procedure. We demonstrate the effectiveness of this approach on several datasets including common benchmark image generation tasks. Our approach turns GAN models into reliable building blocks for deep learning.
- We define a second-order neural network stochastic gradient training algorithm whose block-diagonal structure effectively amounts to normalizing the unit activations. Investigating why this algorithm lacks in robustness then reveals two interesting insights. The first insight suggests a new way to scale the stepsizes, clarifying popular algorithms such as RMSProp as well as old neural network tricks such as fanin stepsize scaling. The second insight stresses the practical importance of dealing with fast changes of the curvature of the cost.
- It can be difficult to tell whether a trained generative model has learned to generate novel examples or has simply memorized a specific set of outputs. In published work, it is common to attempt to address this visually, for example by displaying a generated example and its nearest neighbor(s) in the training set (in, for example, the L2 metric). As any generative model induces a probability density on its output domain, we propose studying this density directly. We first study the geometry of the latent representation and generator, relate this to the output density, and then develop techniques to compute and inspect the output density. As an application, we demonstrate that "memorization" tends to a density made of delta functions concentrated on the memorized examples. We note that without first understanding the geometry, the measurement would be essentially impossible to make.
- A Discriminative Deep Forest (DisDF) as a metric learning algorithm is proposed in the paper. It is based on the Deep Forest or gcForest proposed by Zhou and Feng and can be viewed as a gcForest modification. The case of the fully supervised learning is studied when the class labels of individual training examples are known. The main idea underlying the algorithm is to assign weights to decision trees in random forest in order to reduce distances between objects from the same class and to increase them between objects from different classes. The weights are training parameters. A specific objective function which combines Euclidean and Manhattan distances and simplifies the optimization problem for training the DisDF is proposed. The numerical experiments illustrate the proposed distance metric algorithm.
- This paper focuses on detecting anomalies in a digital video broadcasting (DVB) system from providers' perspective. We learn a probabilistic deterministic real timed automaton profiling benign behavior of encryption control in the DVB control access system. This profile is used as a one-class classifier. Anomalous items in a testing sequence are detected when the sequence is not accepted by the learned model.
- May 29 2017 cs.LG arXiv:1705.09322v1