# Machine Learning (stat.ML)

• Many research fields codify their findings in standard formats, often by reporting correlations between quantities of interest. But the space of all testable correlates is far larger than scientific resources can currently address, so the ability to accurately predict correlations would be useful to plan research and allocate resources. Using a dataset of approximately 170,000 correlational findings extracted from leading social science journals, we show that a trained neural network can accurately predict the reported correlations using only the text descriptions of the correlates. Accurate predictive models such as these can guide scientists towards promising untested correlates, better quantify the information gained from new findings, and has implications for moving artificial intelligence systems from predicting structures to predicting relationships in the real world.
• The surfaces of many cultural heritage objects were embellished with various patterns, especially curve patterns. In practice, most of the unearthed cultural heritage objects are highly fragmented, e.g., sherds of potteries or vessels, and each of them only shows a very small portion of the underlying full design, with noise and deformations. The goal of this paper is to address the challenging problem of automatically identifying the underlying full design of curve patterns from such a sherd. Specifically, we formulate this problem as template matching: curve structure segmented from the sherd is matched to each location with each possible orientation of each known full design. In this paper, we propose a new two-stage matching algorithm, with a different matching cost in each stage. In Stage 1, we use a traditional template matching, which is highly computationally efficient, over the whole search space and identify a small set of candidate matchings. In Stage 2, we derive a new matching cost by training a dual-source Convolutional Neural Network (CNN) and apply it to re-rank the candidate matchings identified in Stage 1. We collect 600 pottery sherds with 98 full designs from the Woodland Period in Southeastern North America for experiments and the performance of the proposed algorithm is very competitive.
• Explicit encoding of group actions in deep features makes it possible for convolutional neural networks (CNNs) to handle global deformations of images, which is critical to success in many vision tasks. This paper proposes to decompose the convolutional filters over joint steerable bases across the space and the group geometry simultaneously, namely a rotation-equivariant CNN with decomposed convolutional filters (RotDCF). This decomposition facilitates computing the joint convolution, which is proved to be necessary for the group equivariance. It significantly reduces the model size and computational complexity while preserving performance, and truncation of the bases expansion serves implicitly to regularize the filters. On datasets involving in-plane and out-of-plane object rotations, RotDCF deep features demonstrate greater robustness and interpretability than regular CNNs. The stability of the equivariant representation to input variations is also proved theoretically under generic assumptions on the filters in the decomposed form. The RotDCF framework can be extended to groups other than rotations, providing a general approach which achieves both group equivariance and representation stability at a reduced model size.
• This paper studies the relationship between the classification performed by deep neural networks and the $k$-NN decision at the embedding space of these networks. This simple important connection shown here provides a better understanding of the relationship between the ability of neural networks to generalize and their tendency to memorize the training data, which are traditionally considered to be contradicting to each other and here shown to be compatible and complementary. Our results support the conjecture that deep neural networks approach Bayes optimal error rates.
• In this paper we explore acceleration techniques for large scale nonconvex optimization problems with special focuses on deep neural networks. The extrapolation scheme is a classical approach for accelerating stochastic gradient descent for convex optimization, but it does not work well for nonconvex optimization typically. Alternatively, we propose an interpolation scheme to accelerate nonconvex optimization and call the method Interpolatron. We explain motivation behind Interpolatron and conduct a thorough empirical analysis. Empirical results on DNNs of great depths (e.g., 98-layer ResNet and 200-layer ResNet) on CIFAR-10 and ImageNet show that Interpolatron can converge much faster than the state-of-the-art methods such as the SGD with momentum and Adam. Furthermore, Anderson's acceleration, in which mixing coefficients are computed by least-squares estimation, can also be used to improve the performance. Both Interpolatron and Anderson's acceleration are easy to implement and tune. We also show that Interpolatron has linear convergence rate under certain regularity assumptions.
• Standard myopic active learning assumes that human annotations are always obtainable whenever new samples are selected. This, however, is unrealistic in many real-world applications where human experts are not readily available at all times. In this paper, we consider the single shot setting: all the required samples should be chosen in a single shot and no human annotation can be exploited during the selection process. We propose a new method, Active Learning through Random Labeling (ALRL), which substitutes single human annotator for multiple, what we will refer to as, pseudo annotators. These pseudo annotators always provide uniform and random labels whenever new unlabeled samples are queried. This random labeling enables standard active learning algorithms to also exhibit the exploratory behavior needed for single shot active learning. The exploratory behavior is further enhanced by selecting the most representative sample via minimizing nearest neighbor distance between unlabeled samples and queried samples. Experiments on real-world datasets demonstrate that the proposed method outperforms several state-of-the-art approaches.
• The project intends to model the stability of power system with a deep learning algorithm to the problem, aiming to delay the removal of the fault. The so-called "fail-delay cut-off" refers to the occurrence of N-1 backup protection action on the backbone network of the system, resulting in longer time for the removal of the fault. In practice, through the analysis and calculation of a large number of online data, we have found that the N-1 failure system of the main protection action will not be unstable, which is also a guarantee of the operation mode arrangement. In the case of the N-1 backup protection action, there is an approximately 2.5% probability that the system will be destabilized. Therefore, research is needed to improve the operating arrangement.
• We apply both distance-based (Jin and Matteson, 2017) and kernel-based (Pfister et al., 2016) mutual dependence measures to independent component analysis (ICA), and generalize dCovICA (Matteson and Tsay, 2017) to MDMICA, minimizing empirical dependence measures as an objective function in both deflation and parallel manners. Solving this minimization problem, we introduce Latin hypercube sampling (LHS) (McKay et al., 2000), and a global optimization method, Bayesian optimization (BO) (Mockus, 1994) to improve the initialization of the Newton-type local optimization method. The performance of MDMICA is evaluated in various simulation studies and an image data example. When the ICA model is correct, MDMICA achieves competitive results compared to existing approaches. When the ICA model is misspecified, the estimated independent components are less mutually dependent than the observed components using MDMICA, while they are prone to be even more mutually dependent than the observed components using other approaches.
• Embedding methods which enforce a partial order or lattice structure over the concept space, such as Order Embeddings (OE) (Vendrov et al., 2016), are a natural way to model transitive relational data (e.g. entailment graphs). However, OE learns a deterministic knowledge base, limiting expressiveness of queries and the ability to use uncertainty for both prediction and learning (e.g. learning from expectations). Probabilistic extensions of OE (Lai and Hockenmaier, 2017) have provided the ability to somewhat calibrate these denotational probabilities while retaining the consistency and inductive bias of ordered models, but lack the ability to model the negative correlations found in real-world knowledge. In this work we show that a broad class of models that assign probability measures to OE can never capture negative correlation, which motivates our construction of a novel box lattice and accompanying probability measure to capture anticorrelation and even disjoint concepts, while still providing the benefits of probabilistic modeling, such as the ability to perform rich joint and conditional queries over arbitrary sets of concepts, and both learning from and predicting calibrated uncertainty. We show improvements over previous approaches in modeling the Flickr and WordNet entailment graphs, and investigate the power of the model.
• Generative Adversarial Nets (GANs) and Variational Auto-Encoders (VAEs) provide impressive image generations from Gaussian white noise, but the underlying mathematics are not well understood. We compute deep convolutional network generators by inverting a fixed embedding operator. Therefore, they do not require to be optimized with a discriminator or an encoder. The embedding is Lipschitz continuous to deformations so that generators transform linear interpolations between input white noise vectors into deformations between output images. This embedding is computed with a wavelet Scattering transform. Numerical experiments demonstrate that the resulting Scattering generators have similar properties as GANs or VAEs, without learning a discriminative network or an encoder.
• In recent years, deep neural network approaches have been widely adopted for machine learning tasks, including classification. However, they were shown to be vulnerable to adversarial perturbations: carefully crafted small perturbations can cause misclassification of legitimate images. We propose Defense-GAN, a new framework leveraging the expressive capability of generative models to defend deep neural networks against such attacks. Defense-GAN is trained to model the distribution of unperturbed images. At inference time, it finds a close output to a given image which does not contain the adversarial changes. This output is then fed to the classifier. Our proposed method can be used with any classification model and does not modify the classifier structure or training procedure. It can also be used as a defense against any attack as it does not assume knowledge of the process for generating the adversarial examples. We empirically show that Defense-GAN is consistently effective against different attack methods and improves on existing defense strategies. Our code has been made publicly available at https://github.com/kabkabm/defensegan.
• In this paper, we propose a general framework to accelerate significantly the algorithms for nonnegative matrix factorization (NMF). This framework is inspired from the extrapolation scheme used to accelerate gradient methods in convex optimization and from the method of parallel tangents. However, the use of extrapolation in the context of the two-block coordinate descent algorithms tackling the non-convex NMF problems is novel. We illustrate the performance of this approach on two state-of-the-art NMF algorithms, namely, accelerated hierarchical alternating least squares (A-HALS) and alternating nonnegative least squares (ANLS), using synthetic, image and document data sets.
• We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. This implies that a DN constructs a set of signal-dependent, class-specific templates against which the signal is compared via a simple inner product; we explore the links to the classical theory of optimal classification via matched filters and the effects of data memorization. Going further, we propose a simple penalty term that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal with each other; this leads to significantly improved classifi- cation performance and reduced overfitting with no change to the DN architecture. The spline partition of the input signal space that is implicitly induced by a MASO directly links DNs to the theory of vector quantization (VQ) and K-means clustering, which opens up new geometric avenue to study how DNs organize signals in a hierarchical fashion. To validate the utility of the VQ interpretation, we develop and validate a new distance metric for signals and images that quantifies the difference between their VQ encodings. (This paper is a significantly expanded version of a paper with the same title that will appear at ICML 2018.)
• Sleep staging plays an important role in assessment and treatment of sleep issues. This work proposes a joint classification-and-prediction framework based on convolutional neural networks (CNNs) for automatic sleep staging, and, subsequently, introduces a simple yet efficient CNN architecture to power the framework. Given a single input epoch, the framework jointly determines its label (classification) and those of its neighboring epochs in the contextual output (prediction). While the proposed framework is orthogonal to the widely adopted classification schemes, which takes one or multiple epochs as a contextual input and produces a single classification decision on the target epoch, we demonstrate its advantages in different ways. First, it leverages the dependency among consecutive sleep epochs while surpassing the problems experienced with the common classification schemes. Second, even with a single model, the framework effortlessly produces multiple decisions as in ensemble-of-models methods which are essential in obtaining a good performance. Probabilistic aggregation techniques are then proposed to leverage the availability of multiple decisions. We demonstrate good performance on the Montreal Archive of Sleep Studies (MASS) dataset consisting 200 subjects with an average accuracy of 83.6%. We also show that the proposed framework not only is superior to the baselines based on the common classification schemes but also outperforms existing deep-learning approaches. To our knowledge, this is the first work going beyond the standard single-output classification to consider neural networks with multiple outputs for automatic sleep staging. This framework could provide avenues for further studies of different neural-network architectures for this problem.
• Structural causal models are a popular tool to describe causal relations in systems in many fields such as economy, the social sciences, and biology. In this work, we show that these models are not flexible enough in general to give a complete causal representation of equilibrium states in dynamical systems that do not have a unique stable equilibrium independent of initial conditions. We prove that our proposed generalized structural causal models do capture the essential causal semantics that characterize these systems. We illustrate the power and flexibility of this extension on a dynamical system corresponding to a basic enzymatic reaction. We motivate our approach further by showing that it also efficiently describes the effects of interventions on functional laws such as the ideal gas law.
• The Gaussian mechanism is an essential building block used in multitude of differentially private data analysis algorithms. In this paper we revisit the Gaussian mechanism and show that the original analysis has several important limitations. Our analysis reveals that the variance formula for the original mechanism is far from tight in the high privacy regime ($\varepsilon \to 0$) and it cannot be extended to the low privacy regime ($\varepsilon \to \infty$). We address this limitations by developing an analytic Gaussian mechanism whose variance is calibrated directly using the Gaussian cumulative density function instead of a tail bound approximation. We also propose to equip the Gaussian mechanism with a post-processing step based on adaptive denoising estimators by leveraging that the variance of the perturbation is known. Our experiments show that analytical calibration removes at least a third of the variance of the noise compared to the classical Gaussian mechanism, and that denoising dramatically improves the accuracy of the Gaussian mechanism in the high-dimensional regime.
• In this paper we study the problem of learning the weights of a deep convolutional neural network. We consider a network where convolutions are carried out over non-overlapping patches with a single kernel in each layer. We develop an algorithm for simultaneously learning all the kernels from the training data. Our approach dubbed Deep Tensor Decomposition (DeepTD) is based on a rank-1 tensor decomposition. We theoretically investigate DeepTD under a realizable model for the training data where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to planted convolutional kernels. We show that DeepTD is data-efficient and provably works as soon as the sample size exceeds the total number of convolutional weights in the network. We carry out a variety of numerical experiments to investigate the effectiveness of DeepTD and verify our theoretical findings.
• We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates the number of topics K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any number of documents (n), individual document length (N_i), dictionary size (p) and number of topics (K), and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.
• We present a high-dimensional analysis of three popular algorithms, namely, Oja's method, GROUSE and PETRELS, for subspace estimation from streaming and highly incomplete observations. We show that, with proper time scaling, the time-varying principal angles between the true subspace and its estimates given by the algorithms converge weakly to deterministic processes when the ambient dimension $n$ tends to infinity. Moreover, the limiting processes can be exactly characterized as the unique solutions of certain ordinary differential equations (ODEs). A finite sample bound is also given, showing that the rate of convergence towards such limits is $\mathcal{O}(1/\sqrt{n})$. In addition to providing asymptotically exact predictions of the dynamic performance of the algorithms, our high-dimensional analysis yields several insights, including an asymptotic equivalence between Oja's method and GROUSE, and a precise scaling relationship linking the amount of missing data to the signal-to-noise ratio. By analyzing the solutions of the limiting ODEs, we also establish phase transition phenomena associated with the steady-state performance of these techniques.
• May 18 2018 stat.ML cs.LG stat.ME arXiv:1805.06826v1
Causal inference from observation data often assumes "strong ignorability," that all confounders are observed. This assumption is standard yet untestable. However, many scientific studies involve multiple causes, different variables whose effects are simultaneously of interest. We propose the deconfounder, an algorithm that combines unsupervised machine learning and predictive model checking to perform causal inference in multiple-cause settings. The deconfounder infers a latent variable as a substitute for unobserved confounders and then uses that substitute to perform causal inference. We develop theory for when the deconfounder leads to unbiased causal estimates, and show that it requires weaker assumptions than classical causal inference. We analyze its performance in three types of studies: semi-simulated data around smoking and lung cancer, semi-simulated data around genomewide association studies, and a real dataset about actors and movie revenue. The deconfounder provides a checkable approach to estimating close-to-truth causal effects.
• May 18 2018 cs.LG stat.ML arXiv:1805.06792v1
We consider the use of no-regret algorithms to compute equilibria for particular classes of convex-concave games. While standard regret bounds would lead to convergence rates on the order of $O(T^{-1/2})$, recent work \citepRS13,SALS15 has established $O(1/T)$ rates by taking advantage of a particular class of optimistic prediction algorithms. In this work we go further, showing that for a particular class of games one achieves a $O(1/T^2)$ rate, and we show how this applies to the Frank-Wolfe method and recovers a similar bound \citepD15. We also show that such no-regret techniques can even achieve a linear rate, $O(\exp(-T))$, for equilibrium computation under additional curvature assumptions.
• We conduct an extensive empirical study on short-term electricity price forecasting (EPF) to address the long-standing question if the optimal model structure for EPF is univariate or multivariate. We provide evidence that despite a minor edge in predictive performance overall, the multivariate modeling framework does not uniformly outperform the univariate one across all 12 considered datasets, seasons of the year or hours of the day, and at times is outperformed by the latter. This is an indication that combining advanced structures or the corresponding forecasts from both modeling approaches can bring a further improvement in forecasting accuracy. We show that this indeed can be the case, even for a simple averaging scheme involving only two models. Finally, we also analyze variable selection for the best performing high-dimensional lasso-type models, thus provide guidelines to structuring better performing forecasting model designs.
• As a crucial problem in statistics is to decide whether additional variables are needed in a regression model. We propose a new multivariate test to investigate the conditional mean independence of Y given X conditioning on some known effect Z, i.e., E(Y|X, Z) = E(Y|Z). Assuming that E(Y|Z) and Z are linearly related, we reformulate an equivalent notion of conditional mean independence through transformation, which is approximated in practice. We apply the martingale difference divergence (Shao and Zhang, 2014) to measure conditional mean dependence, and show that the estimation error from approximation is negligible, as it has no impact on the asymptotic distribution of the test statistic under some regularity assumptions. The implementation of our test is demonstrated by both simulations and a financial data example.
• A key problem in location-based modeling and forecasting lies in identifying suitable spatial and temporal resolutions. In particular, judicious spatial partitioning can play a significant role in enhancing the performance of location-based forecasting models. In this work, we investigate two widely used tessellation strategies for partitioning city space, in the context of real-time taxi demand forecasting. Our study compares (i) Geohash tessellation, and (ii) Voronoi tessellation, using two distinct taxi demand datasets, over multiple time scales. For the purpose of comparison, we employ classical time-series tools to model the spatio-temporal demand. Our study finds that the performance of each tessellation strategy is highly dependent on the city geography, spatial distribution of the data, and the time of the day, and that neither strategy is found to perform optimally across the forecast horizon. We propose a hybrid tessellation algorithm that picks the best tessellation strategy at each instant, based on their performance in the recent past. Our hybrid algorithm is a non-stationary variant of the well-known HEDGE algorithm for choosing the best advice from multiple experts. We show that the hybrid tessellation strategy performs consistently better than either of the two strategies across the data sets considered, at multiple time scales, and with different performance metrics. We achieve an average accuracy of above 80% per km^2 for both data sets considered at 60 minute aggregation levels.
• May 18 2018 stat.ML cs.LG arXiv:1805.06595v1
Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors far greater than the sample size. In order to identify more novel biomarkers and understand biological mechanisms, it is vital to detect signals weakly associated with outcomes among ultrahigh-dimensional predictors. However, existing screening methods, which typically ignore correlation information, are likely to miss these weak signals. By incorporating the inter-feature dependence, we propose a covariance-insured screening methodology to identify predictors that are jointly informative but only marginally weakly associated with outcomes. The validity of the method is examined via extensive simulations and real data studies for selecting potential genetic factors related to the onset of cancer.