results for au:Wang_L in:stat

- Recurrent Neural Networks (RNNs) are powerful sequence modeling tools. However, when dealing with high dimensional inputs, the training of RNNs becomes computational expensive due to the large number of model parameters. This hinders RNNs from solving many important computer vision tasks, such as Action Recognition in Videos and Image Captioning. To overcome this problem, we propose a compact and flexible structure, namely Block-Term tensor decomposition, which greatly reduces the parameters of RNNs and improves their training efficiency. Compared with alternative low-rank approximations, such as tensor-train RNN (TT-RNN), our method, Block-Term RNN (BT-RNN), is not only more concise (when using the same rank), but also able to attain a better approximation to the original RNNs with much fewer parameters. On three challenging tasks, including Action Recognition in Videos, Image Captioning and Image Generation, BT-RNN outperforms TT-RNN and the standard RNN in terms of both prediction accuracy and convergence rate. Specifically, BT-LSTM utilizes 17,388 times fewer parameters than the standard LSTM to achieve an accuracy improvement over 15.6\% in the Action Recognition task on the UCF11 dataset.
- We compare and contrast the statistical physics and quantum physics inspired approaches for unsupervised generative modeling of classical data. The two approaches represent probabilities of observed data using energy-based models and quantum states respectively.Classical and quantum information patterns of the target datasets therefore provide principled guidelines for structural design and learning in these two approaches. Taking the restricted Boltzmann machines (RBM) as an example, we analyze the information theoretical bounds of the two approaches. We verify our reasonings by comparing the performance of RBMs of various architectures on the standard MNIST datasets.
- Wisdom of the crowd, the collective intelligence derived from responses of multiple human or machine individuals to the same questions, can be more accurate than each individual, and improve social decision-making and prediction accuracy. This can also integrate multiple programs or datasets, each as an individual, for the same predictive questions. Crowd wisdom estimates each individual's independent error level arising from their limited knowledge, and finds the crowd consensus that minimizes the overall error. However, previous studies have merely built isolated, problem-specific models with limited generalizability, and mainly for binary (yes/no) responses. Here we show with simulation and real-world data that the crowd wisdom problem is analogous to one-dimensional unsupervised dimension reduction in machine learning. This provides a natural class of crowd wisdom solutions, such as principal component analysis and Isomap, which can handle binary and also continuous responses, like confidence levels, and consequently can be more accurate than existing solutions. They can even outperform supervised-learning-based collective intelligence that is calibrated on historical performance of individuals, e.g. penalized linear regression and random forest. This study unifies crowd wisdom and unsupervised dimension reduction, and thereupon introduces a broad range of highly-performing and widely-applicable crowd wisdom methods. As the costs for data acquisition and processing rapidly decrease, this study will promote and guide crowd wisdom applications in the social and natural sciences, including data fusion, meta-analysis, crowd-sourcing, and committee decision making.
- We consider how to quantify the causal effect from a random variable to a response variable. We show that with multiple Markov boundaries, conditional mutual information (CMI) will produce 0, while causal strength (CS) and part mutual information (PMI), which claim to behave better, are not well-defined, and have other problems. The reason is that the quantitative causal inference with multiple Markov boundaries is an ill-posed problem. We will give a criterion and some applicable algorithms to determine whether a distribution has non-unique Markov boundaries.
- Oct 17 2017 stat.ME arXiv:1710.05283v1\citebickel2009nonparametric developed a general framework to establish consistency of community detection in stochastic block model (SBM). In most applications of this framework, the community label is discrete. For example, in \citepbickel2009nonparametric,zhao2012consistency the degree corrected SBM is assumed to have a discrete degree parameter. In this paper, we generalize the method of \citebickel2009nonparametric to give consistency analysis of maximum likelihood estimator (MLE) in SBM with continuous community label. We show that there is a standard procedure to transform the $||\cdot||_2$ error bound to the uniform error bound. We demonstrate the application of our general results by proving the uniform consistency (strong consistency) of the MLE in the exponential network model with interaction effect. Unfortunately, in the continuous parameter case, the condition ensuring uniform consistency we obtained is much stronger than that in the discrete parameter case, namely $n\mu_n^5/(\log n)^{8}\rightarrow\infty$ versus $n\mu_n/\log n\rightarrow\infty$. Where $n\mu_n$ represents the average degree of the network. But continuous is the limit of discrete. So it is not surprising as we show that by discretizing the community label space into sufficiently small (but not too small) pieces and applying the MLE on the discretized community label space, uniform consistency holds under almost the same condition as in discrete community label space. Such a phenomenon is surprising since the discretization does not depend on the data or the model. This reminds us of the thresholding method.
- With the maturation of metabolomics science and proliferation of biobanks, clinical metabolic profiling is an increasingly opportunistic frontier for advancing translational clinical research. Automated Machine Learning (AutoML) approaches provide exciting opportunity to guide feature selection in agnostic metabolic profiling endeavors, where potentially thousands of independent data points must be evaluated. In previous research, AutoML using high-dimensional data of varying types has been demonstrably robust, outperforming traditional approaches. However, considerations for application in clinical metabolic profiling remain to be evaluated. Particularly, regarding the robustness of AutoML to identify and adjust for common clinical confounders. In this study, we present a focused case study regarding AutoML considerations for using the Tree-Based Optimization Tool (TPOT) in metabolic profiling of exposure to metformin in a biobank cohort. First, we propose a tandem rank-accuracy measure to guide agnostic feature selection and corresponding threshold determination in clinical metabolic profiling endeavors. Second, while AutoML, using default parameters, demonstrated potential to lack sensitivity to low-effect confounding clinical covariates, we demonstrated residual training and adjustment of metabolite features as an easily applicable approach to ensure AutoML adjustment for potential confounding characteristics. Finally, we present increased homocysteine with long-term exposure to metformin as a potentially novel, non-replicated metabolite association suggested by TPOT; an association not identified in parallel clinical metabolic profiling endeavors. While considerations are recommended, including adjustment approaches for clinical confounders, AutoML presents an exciting tool to enhance clinical metabolic profiling and advance translational research endeavors.
- Sep 26 2017 stat.ME arXiv:1709.08281v1Structural nested mean models (SNMMs) are among the fundamental tools for inferring causal effects of time-dependent exposures from longitudinal studies. With binary outcomes, however, current methods for estimating multiplicative and additive SNMM parameters suffer from variation dependence between the causal SNMM parameters and the non-causal nuisance parameters. Estimating methods for logistic SNMMs do not suffer from this dependence. Unfortunately, in contrast with the multiplicative and additive models, unbiased estimation of the causal parameters of a logistic SNMM rely on additional modeling assumptions even when the treatment probabilities are known. These difficulties have hindered the uptake of SNMMs in epidemiological practice, where binary outcomes are common. We solve the variation dependence problem for the binary multiplicative SNMM by a reparametrization of the non-causal nuisance parameters. Our novel nuisance parameters are variation independent of the causal parameters, and hence allows the fitting of a multiplicative SNMM by unconstrained maximum likelihood. It also allows one to construct true (i.e. congenial) doubly robust estimators of the causal parameters. Along the way, we prove that an additive SNMM with binary outcomes does not admit a variation independent parametrization, thus explaining why we restrict ourselves to the multiplicative SNMM.
- Generative modeling, which learns joint probability distribution from training data and generates samples according to it, is an important task in machine learning and artificial intelligence. Inspired by probabilistic interpretation of quantum physics, we propose a generative model using matrix product states, which is a tensor network originally proposed for describing (particularly one-dimensional) entangled quantum states. Our model enjoys efficient learning by utilizing the density matrix renormalization group method which allows dynamic adjusting dimensions of the tensors, and offers an efficient direct sampling approach, Zipper, for generative tasks. We apply our method to generative modeling of several standard datasets including the principled Bars and Stripes, random binary patterns and the MNIST handwritten digits, to illustrate ability of our model, and discuss features as well as drawbacks of our model over popular generative models such as Hopfield model, Boltzmann machines and generative adversarial networks. Our work shed light on many interesting directions for future exploration on the development of quantum-inspired algorithms for unsupervised machine learning, which is of possibility of being realized by a quantum device.
- t-Distributed Stochastic Neighbor Embedding (t-SNE) is one of the most widely used dimensionality reduction methods for data visualization, but it has a perplexity hyperparameter that requires manual selection. In practice, proper tuning of t-SNE perplexity requires users to understand the inner working of the method as well as to have hands-on experience. We propose a model selection objective for t-SNE perplexity that requires negligible extra computation beyond that of the t-SNE itself. We empirically validate that the perplexity settings found by our approach are consistent with preferences elicited from human experts across a number of datasets. The similarities of our approach to Bayesian information criteria (BIC) and minimum description length (MDL) are also analyzed.
- Jul 27 2017 stat.ME arXiv:1707.08215v2We consider the problem of calibrating inexact computer models using experimental data. To compensate for the misspecification of the computer model, a discrepancy function is usually included and modeled via a Gaussian stochastic process (GaSP), leading to better results of prediction. The calibrated computer model itself, however, sometimes fits the experimental data poorly, as the calibration parameters become unidentifiable. In this work, we propose the scaled Gaussian stochastic process (S-GaSP), a novel stochastic process for calibration and prediction. This new approach bridges the gap between two predominant methods, namely the $L_2$ calibration and GaSP calibration. A computationally feasible approach is introduced for this new model under the Bayesian paradigm. Compared with the GaSP calibration, the calibrated parameters are more interpretable in the S-GaSP calibration, which enables the calibrated computer model itself to predict well on the reality. Simulated and real examples are provided to illustrate the connections and differences between this new model and other previous approaches.
- Jul 21 2017 stat.CO arXiv:1707.06360v1This article focuses on the problem of studying shared- and individual-specific structure in replicated networks or graph-valued data. In particular, the observed data consist of $n$ graphs, $G_i, i=1,\ldots,n$, with each graph consisting of a collection of edges between $V$ nodes. In brain connectomics, the graph for an individual corresponds to a set of interconnections among brain regions. Such data can be organized as a $V \times V$ binary adjacency matrix $A_i$ for each $i$, with ones indicating an edge between a pair of nodes and zeros indicating no edge. When nodes have a shared meaning across replicates $i=1,\ldots,n$, it becomes of substantial interest to study similarities and differences in the adjacency matrices. To address this problem, we propose a method to estimate a common structure and low-dimensional individual-specific deviations from replicated networks. The proposed Multiple GRAph Factorization (M-GRAF) model relies on a logistic regression mapping combined with a hierarchical singular value decomposition. We develop an efficient algorithm for maximum likelihood estimation, and study basic properties of our approach. Simulation studies show excellent operating characteristics and we apply the method to human brain connectomics data.
- Algorithm-dependent generalization error bounds are central to statistical learning theory. A learning algorithm may use a large hypothesis space, but the limited number of iterations controls its model capacity and generalization error. The impacts of stochastic gradient methods on generalization error for non-convex learning problems not only have important theoretical consequences, but are also critical to generalization errors of deep learning. In this paper, we study the generalization errors of Stochastic Gradient Langevin Dynamics (SGLD) with non-convex objectives. Two theories are proposed with non-asymptotic discrete-time analysis, using Stability and PAC-Bayesian results respectively. The stability-based theory obtains a bound of $O\left(\frac{1}{n}L\sqrt{\beta T_k}\right)$, where $L$ is uniform Lipschitz parameter, $\beta$ is inverse temperature, and $T_k$ is aggregated step sizes. For PAC-Bayesian theory, though the bound has a slower $O(1/\sqrt{n})$ rate, the contribution of each step is shown with an exponentially decaying factor by imposing $\ell^2$ regularization, and the uniform Lipschitz constant is also replaced by actual norms of gradients along trajectory. Our bounds have no implicit dependence on dimensions, norms or other capacity measures of parameter, which elegantly characterizes the phenomenon of "Fast Training Guarantees Generalization" in non-convex settings. This is the first algorithm-dependent result with reasonable dependence on aggregated step sizes for non-convex learning, and has important implications to statistical learning aspects of stochastic gradient methods in complicated models such as deep learning.
- Ridesourcing platforms like Uber and Didi are getting more and more popular around the world. However, unauthorized ridesourcing activities taking advantages of the sharing economy can greatly impair the healthy development of this emerging industry. As the first step to regulate on-demand ride services and eliminate black market, we design a method to detect ridesourcing cars from a pool of cars based on their trajectories. Since licensed ridesourcing car traces are not openly available and may be completely missing in some cities due to legal issues, we turn to transferring knowledge from public transport open data, i.e, taxis and buses, to ridesourcing detection among ordinary vehicles. We propose a two-stage transfer learning framework. In Stage 1, we take taxi and bus data as input to learn a random forest (RF) classifier using trajectory features shared by taxis/buses and ridesourcing/other cars. Then, we use the RF to label all the candidate cars. In Stage 2, leveraging the subset of high confident labels from the previous stage as input, we further learn a convolutional neural network (CNN) classifier for ridesourcing detection, and iteratively refine RF and CNN, as well as the feature set, via a co-training process. Finally, we use the resulting ensemble of RF and CNN to identify the ridesourcing cars in the candidate pool. Experiments on real car, taxi and bus traces show that our transfer learning framework, with no need of a pre-labeled ridesourcing dataset, can achieve similar accuracy as the supervised learning methods.
- Advances in mobile computing technologies have made it possible to monitor and apply data-driven interventions across complex systems in real time. Markov decision processes (MDPs) are the primary model for sequential decision problems with a large or indefinite time horizon. Choosing a representation of the underlying decision process that is both Markov and low-dimensional is non-trivial. We propose a method for constructing a low-dimensional representation of the original decision process for which: 1. the MDP model holds; 2. a decision strategy that maximizes mean utility when applied to the low-dimensional representation also maximizes mean utility when applied to the original process. We use a deep neural network to define a class of potential process representations and estimate the process of lowest dimension within this class. The method is illustrated using data from a mobile study on heavy drinking and smoking among college students.
- We consider the phase retrieval problem of recovering the unknown signal from the magnitude-only measurements, where the measurements can be contaminated by both sparse arbitrary corruption and bounded random noise. We propose a new nonconvex algorithm for robust phase retrieval, namely Robust Wirtinger Flow, to jointly estimate the unknown signal and the sparse corruption. We show that our proposed algorithm is guaranteed to converge linearly to the unknown true signal up to a minimax optimal statistical precision in such a challenging setting. Compared with existing robust phase retrieval methods, we improved the statistical error rate by a factor of $\sqrt{n/m}$ where $n$ is the dimension of the signal and $m$ is the sample size, provided a refined characterization of the corruption fraction requirement, and relaxed the lower bound condition on the number of corruption. In the noise-free case, our algorithm converges to the unknown signal at a linear rate and achieves optimal sample complexity up to a logarithm factor. Thorough experiments on both synthetic and real datasets corroborate our theory.
- Triplet networks are widely used models that are characterized by good performance in classification and retrieval tasks. In this work we propose to train a triplet network by putting it as the discriminator in Generative Adversarial Nets (GANs). We make use of the good capability of representation learning of the discriminator to increase the predictive quality of the model. We evaluated our approach on Cifar10 and MNIST datasets and observed significant improvement on the classification performance using the simple k-nn method.
- In this paper, we consider efficient differentially private empirical risk minimization from the viewpoint of optimization algorithms. For strongly convex and smooth objectives, we prove that gradient descent with output perturbation not only achieves nearly optimal utility, but also significantly improves the running time of previous state-of-the-art private optimization algorithms, for both $\epsilon$-DP and $(\epsilon, \delta)$-DP. For non-convex but smooth objectives, we propose an RRPSGD (Random Round Private Stochastic Gradient Descent) algorithm, which provably converges to a stationary point with privacy guarantee. Besides the expected utility bounds, we also provide guarantees in high probability form. Experiments demonstrate that our algorithm consistently outperforms existing method in both utility and running time.
- Iterative Hard Thresholding (IHT) is a class of projected gradient descent methods for optimizing sparsity-constrained minimization models, with the best known efficiency and scalability in practice. As far as we know, the existing IHT-style methods are designed for sparse minimization in primal form. It remains open to explore duality theory and algorithms in such a non-convex and NP-hard problem setting. In this paper, we bridge this gap by establishing a duality theory for sparsity-constrained minimization with $\ell_2$-regularized loss function and proposing an IHT-style algorithm for dual maximization. Our sparse duality theory provides a set of sufficient and necessary conditions under which the original NP-hard/non-convex problem can be equivalently solved in a dual formulation. The proposed dual IHT algorithm is a super-gradient method for maximizing the non-smooth dual objective. An interesting finding is that the sparse recovery performance of dual IHT is invariant to the Restricted Isometry Property (RIP), which is required by virtually all the existing primal IHT algorithms without sparsity relaxation. Moreover, a stochastic variant of dual IHT is proposed for large-scale stochastic optimization. Numerical results demonstrate the superiority of dual IHT algorithms to the state-of-the-art primal IHT-style algorithms in model estimation accuracy and computational efficiency.
- Boltzmann machines are physics informed generative models with wide applications in machine learning. They can learn the probability distribution from an input dataset and generate new samples accordingly. Applying them back to physics, the Boltzmann machines are ideal recommender systems to accelerate Monte Carlo simulation of physical systems due to their flexibility and effectiveness. More intriguingly, we show that the generative sampling of the Boltzmann Machines can even discover unknown cluster Monte Carlo algorithms. The creative power comes from the latent representation of the Boltzmann machines, which learn to mediate complex interactions and identify clusters of the physical system. We demonstrate these findings with concrete examples of the classical Ising model with and without four spin plaquette interactions. Our results endorse a fresh research paradigm where intelligent machines are designed to create or inspire human discovery of innovative algorithms.
- Feb 22 2017 stat.ML arXiv:1702.06525v2We study the problem of low-rank plus sparse matrix recovery. We propose a generic and efficient nonconvex optimization algorithm based on projected gradient descent and double thresholding operator, with much lower computational complexity. Compared with existing convex-relaxation based methods, the proposed algorithm recovers the low-rank plus sparse matrices for free, without incurring any additional statistical cost. It not only enables exact recovery of the unknown low-rank and sparse matrices in the noiseless setting, and achieves minimax optimal statistical error rate in the noisy case, but also matches the best-known robustness guarantee (i.e., tolerance for sparse corruption). At the core of our theory is a novel structural Lipschitz gradient condition for low-rank plus sparse matrices, which is essential for proving the linear convergence rate of our algorithm, and we believe is of independent interest to prove fast rates for general superposition-structured models. We demonstrate the superiority of our generic algorithm, both theoretically and experimentally, through three concrete applications: robust matrix sensing, robust PCA and one-bit matrix decomposition.
- Traditionally, multi-layer neural networks use dot product between the output vector of previous layer and the incoming weight vector as the input to activation function. The result of dot product is unbounded, thus increases the risk of large variance. Large variance of neuron makes the model sensitive to the change of input distribution, thus results in poor generalization, and aggravates the internal covariate shift which slows down the training. To bound dot product and decrease the variance, we propose to use cosine similarity or centered cosine similarity (Pearson Correlation Coefficient) instead of dot product in neural networks, which we call cosine normalization. We compare cosine normalization with batch, weight and layer normalization in fully-connected neural networks as well as convolutional networks on the data sets of MNIST, 20NEWS GROUP, CIFAR-10/100 and SVHN. Experiments show that cosine normalization achieves better performance than other normalization techniques.
- In this work we study the quantitative relation between the recursive teaching dimension (RTD) and the VC dimension (VCD) of concept classes of finite sizes. The RTD of a concept class $\mathcal C \subseteq \{0, 1\}^n$, introduced by Zilles et al. (2011), is a combinatorial complexity measure characterized by the worst-case number of examples necessary to identify a concept in $\mathcal C$ according to the recursive teaching model. For any finite concept class $\mathcal C \subseteq \{0,1\}^n$ with $\mathrm{VCD}(\mathcal C)=d$, Simon & Zilles (2015) posed an open problem $\mathrm{RTD}(\mathcal C) = O(d)$, i.e., is RTD linearly upper bounded by VCD? Previously, the best known result is an exponential upper bound $\mathrm{RTD}(\mathcal C) = O(d \cdot 2^d)$, due to Chen et al. (2016). In this paper, we show a quadratic upper bound: $\mathrm{RTD}(\mathcal C) = O(d^2)$, much closer to an answer to the open problem. We also discuss the challenges in fully solving the problem.
- Identification and estimation of causal effects with confounders subject to instrumental missingnessFeb 15 2017 stat.ME arXiv:1702.03951v2Drawing causal inference from unconfounded observational studies is of great importance, which, however, is jeopardized if the confounders are subject to missingness. Generally, it is impossible to identify causal effects if the confounders are missing not at random. In this paper, we propose a novel framework to nonparametrically identify the causal effects with confounders missing not at random, but subject to instrumental missingness, that is, the missing data mechanism is independent of the outcome, given the treatment and possibly missing confounder values. The average causal effect is then estimated using a nonparametric two-stage least squares estimator based on series approximation.
- P-values are being computed for increasingly complicated statistics but lacking evaluations on their quality. Meanwhile, accurate p-values enable significance comparison across batches of hypothesis tests and consequently unified false discover rate (FDR) control. This article discusses two related questions in this setting. First, we propose statistical tests to evaluate the quality of p-value and the cross-batch comparability of any other statistic. Second, we propose a lasso based variable selection statistic, based on when the predictor variable first becomes active, and compute its p-value to achieve unified FDR control across multiple selections. In the end, we apply our tests on covTest, selectiveInference, and our statistic, based on real and null datasets for network inference in normal and high-dimensional settings. Results demonstrate higher p-value quality from our statistic and reveal p-value errors from others hidden before. We implement our statistic as lassopv in R.
- Restricted Boltzmann machine (RBM) is one of the fundamental building blocks of deep learning. RBM finds wide applications in dimensional reduction, feature extraction, and recommender systems via modeling the probability distributions of a variety of input data including natural images, speech signals, and customer ratings, etc. We build a bridge between RBM and tensor network states (TNS) widely used in quantum many-body physics research. We devise efficient algorithms to translate an RBM into the commonly used TNS. Conversely, we give sufficient and necessary conditions to determine whether a TNS can be transformed into an RBM of given architectures. Revealing these general and constructive connections can cross-fertilize both deep learning and quantum-many body physics. Notably, by exploiting the entanglement entropy bound of TNS, we can rigorously quantify the expressive power of RBM on complex datasets. Insights into TNS and its entanglement capacity can guide the design of more powerful deep learning architectures. On the other hand, RBM can represent quantum many-body states with fewer parameters compared to TNS, which may allow more efficient classical simulations.
- Jan 10 2017 stat.ML arXiv:1701.02301v2We propose a generic framework based on a new stochastic variance-reduced gradient descent algorithm for accelerating nonconvex low-rank matrix recovery. Starting from an appropriate initial estimator, our proposed algorithm performs projected gradient descent based on a novel semi-stochastic gradient specifically designed for low-rank matrix recovery. Based upon the mild restricted strong convexity and smoothness conditions, we derive a projected notion of the restricted Lipschitz continuous gradient property, and prove that our algorithm enjoys linear convergence rate to the unknown low-rank matrix with an improved computational complexity. Moreover, our algorithm can be employed to both noiseless and noisy observations, where the optimal sample complexity and the minimax optimal statistical rate can be attained respectively. We further illustrate the superiority of our generic framework through several specific examples, both theoretically and experimentally.
- Jan 03 2017 stat.ML arXiv:1701.00481v2We study the problem of estimating low-rank matrices from linear measurements (a.k.a., matrix sensing) through nonconvex optimization. We propose an efficient stochastic variance reduced gradient descent algorithm to solve a nonconvex optimization problem of matrix sensing. Our algorithm is applicable to both noisy and noiseless settings. In the case with noisy observations, we prove that our algorithm converges to the unknown low-rank matrix at a linear rate up to the minimax optimal statistical error. And in the noiseless setting, our algorithm is guaranteed to linearly converge to the unknown low-rank matrix and achieves exact recovery with optimal sample complexity. Most notably, the overall computational complexity of our proposed algorithm, which is defined as the iteration complexity times per iteration time complexity, is lower than the state-of-the-art algorithms based on gradient descent. Experiments on synthetic data corroborate the superiority of the proposed algorithm over the state-of-the-art algorithms.
- Recommender systems play an essential role in the modern business world. They recommend favorable items like books, movies, and search queries to users based on their past preferences. Applying similar ideas and techniques to Monte Carlo simulations of physical systems boosts their efficiency without sacrificing accuracy. Exploiting the quantum to classical mapping inherent in the continuous-time quantum Monte Carlo methods, we construct a classical molecular gas model to reproduce the quantum distributions. We then utilize powerful molecular simulation techniques to propose efficient quantum Monte Carlo updates. The recommender engine approach provides a general way to speed up the quantum impurity solvers.
- Dec 01 2016 stat.ME arXiv:1611.09925v3Instrumental variables (IVs) are widely used for estimating causal effects in the presence of unmeasured confounding. Under the standard IV model, however, the average treatment effect (ATE) is only partially identifiable. To address this, we propose novel assumptions that allow for identification of the ATE. Our identification assumptions are clearly separated from model assumptions needed for estimation, so that researchers are not required to commit to a specific observed data model in establishing identification. We then construct multiple estimators that are consistent under three different observed data models, and multiply robust estimators that are consistent in the union of these observed data models. We pay special attention to the case of binary outcomes, for which we obtain bounded estimators of the ATE that are guaranteed to lie between -1 and 1. Our approaches are illustrated with simulations and a data analysis evaluating the causal effect of education on earnings.
- Since about 100 years ago, to learn the intrinsic structure of data, many representation learning approaches have been proposed, including both linear ones and nonlinear ones, supervised ones and unsupervised ones. Particularly, deep architectures are widely applied for representation learning in recent years, and have delivered top results in many tasks, such as image classification, object detection and speech recognition. In this paper, we review the development of data representation learning methods. Specifically, we investigate both traditional feature learning algorithms and state-of-the-art deep learning models. The history of data representation learning is introduced, while available resources (e.g. online course, tutorial and book information) and toolboxes are provided. Finally, we conclude this paper with remarks and some interesting research directions on data representation learning.
- Probabilistic Temporal Tensor Factorization (PTTF) is an effective algorithm to model the temporal tensor data. It leverages a time constraint to capture the evolving properties of tensor data. Nowadays the exploding dataset demands a large scale PTTF analysis, and a parallel solution is critical to accommodate the trend. Whereas, the parallelization of PTTF still remains unexplored. In this paper, we propose a simple yet efficient Parallel Probabilistic Temporal Tensor Factorization, referred to as P$^2$T$^2$F, to provide a scalable PTTF solution. P$^2$T$^2$F is fundamentally disparate from existing parallel tensor factorizations by considering the probabilistic decomposition and the temporal effects of tensor data. It adopts a new tensor data split strategy to subdivide a large tensor into independent sub-tensors, the computation of which is inherently parallel. We train P$^2$T$^2$F with an efficient algorithm of stochastic Alternating Direction Method of Multipliers, and show that the convergence is guaranteed. Experiments on several real-word tensor datasets demonstrate that P$^2$T$^2$F is a highly effective and efficiently scalable algorithm dedicated for large scale probabilistic temporal tensor analysis.
- We propose a novel probabilistic dimensionality reduction framework that can naturally integrate the generative model and the locality information of data. Based on this framework, we present a new model, which is able to learn a smooth skeleton of embedding points in a low-dimensional space from high-dimensional noisy data. The formulation of the new model can be equivalently interpreted as two coupled learning problem, i.e., structure learning and the learning of projection matrix. This interpretation motivates the learning of the embedding points that can directly form an explicit graph structure. We develop a new method to learn the embedding points that form a spanning tree, which is further extended to obtain a discriminative and compact feature representation for clustering problems. Unlike traditional clustering methods, we assume that centers of clusters should be close to each other if they are connected in a learned graph, and other cluster centers should be distant. This can greatly facilitate data visualization and scientific discovery in downstream analysis. Extensive experiments are performed that demonstrate that the proposed framework is able to obtain discriminative feature representations, and correctly recover the intrinsic structures of various real-world datasets.
- Oct 18 2016 stat.ML arXiv:1610.05275v1We propose a unified framework for estimating low-rank matrices through nonconvex optimization based on gradient descent algorithm. Our framework is quite general and can be applied to both noisy and noiseless observations. In the general case with noisy observations, we show that our algorithm is guaranteed to linearly converge to the unknown low-rank matrix up to minimax optimal statistical error, provided an appropriate initial estimator. While in the generic noiseless setting, our algorithm converges to the unknown low-rank matrix at a linear rate and enables exact recovery with optimal sample complexity. In addition, we develop a new initialization algorithm to provide a desired initial estimator, which outperforms existing initialization algorithms for nonconvex low-rank matrix estimation. We illustrate the superiority of our framework through three examples: matrix regression, matrix completion, and one-bit matrix completion. We also corroborate our theory through extensive experiments on synthetic data.
- Despite their exceptional flexibility and popularity, the Monte Carlo methods often suffer from slow mixing times for challenging statistical physics problems. We present a general strategy to overcome this difficulty by adopting ideas and techniques from the machine learning community. We fit the unnormalized probability of the physical model to a feedforward neural network and reinterpret the architecture as a restricted Boltzmann machine. Then, exploiting its feature detection ability, we utilize the restricted Boltzmann machine for efficient Monte Carlo updates and to speed up the simulation of the original physical system. We implement these ideas for the Falicov-Kimball model and demonstrate improved acceptance ratio and autocorrelation time near the phase transition point.
- Sep 27 2016 stat.ME arXiv:1609.07690v1Bridge sampling is an effective Monte Carlo method for estimating the ratio of normalizing constants of two probability densities, a routine computational problem in statistics, physics, chemistry, etc. The Monte Carlo error of the bridge sampling estimator is determined by the amount of overlap between the two densities. Complementing and generalizing the Warp-I, II, and III transformations (Meng and Schilling, 2002), which are most effective for increasing the overlap between two uni-modal densities, we introduce Warp-U transformations that aim to transform multi-modal densities into uni-modal ones but without altering their normalizing constants. The construction of Warp-U transformation starts with a Gaussian (or other convenient) mixture distribution that has a reasonable overlap with a target density p underlying the unknown normalizing constant. The stochastic transformation that maps the Gaussian mixture distribution back to its generating distribution N(0,1) is then applied to p, resulting in its Warp-U transformation. The overlap between the Warp-U transformation density and N (0,1) is theoretically guaranteed to be no less than the overlap between p and the Gaussian mixture, as measured by any f-divergence, leading to statistically more efficient bridge sampling estimators. We propose a computationally efficient method to find an appropriate the Gaussian mixture, and use simulations to explore various estimation strategies and the choices of tuning parameters, with the aim to achieve statistical efficiency without unduly losing computational efficiency. We illustrate our findings using 10-50 dimensional highly irregular multi-modal densities. We also propose a strategy for using Warp-U transformations to improve MCMC algorithms, especially for sampling from multi-modal distributions.
- Jul 12 2016 stat.ME arXiv:1607.02631v3Nonmonotone missing data arise routinely in empirical studies of social and health sciences, and when ignored, can induce selection bias and loss of efficiency. In practice, it is common to account for nonresponse under a missing-at-random assumption which although convenient, is rarely appropriate when nonresponse is nonmonotone. Likelihood and Bayesian missing data methodologies often require specification of a parametric model for the full data law, thus a priori ruling out any prospect for semiparametric inference. In this paper, we propose an all-purpose approach which delivers semiparametric inferences when missing data are nonmonotone and not at random. The approach is based on a discrete choice model (DCM) as a means to generate a large class of nonmonotone nonresponse mechanisms that are nonignorable. Sufficient conditions for nonparametric identification are given, and a general framework for fully parametric and semiparametric inference under an arbitrary DCM is proposed. Special consideration is given to the case of logit discrete choice nonresponse model (LDCM) for which we describe generalizations of inverse-probability weighting, pattern-mixture estimation, doubly robust estimation and multiply robust estimation.
- Jun 06 2016 stat.ME arXiv:1606.00921v2There is increasing interest in learning how human brain networks vary as a function of a continuous trait, but flexible and efficient procedures to accomplish this goal are limited. We develop a Bayesian semiparametric model, which combines low-rank factorizations and flexible Gaussian process priors to learn changes in the conditional expectation of a network-valued random variable across the values of a continuous predictor, while including subject-specific random effects. The formulation leads to a general framework for inference on changes in brain network structures across human traits, facilitating borrowing of information and coherently characterizing uncertainty. We provide an efficient Gibbs sampler for posterior computation along with simple procedures for inference, prediction and goodness-of-fit assessments. The model is applied to learn how human brain networks vary across individuals with different intelligence scores. Results provide interesting insights on the association between intelligence and brain connectivity, while demonstrating good predictive performance.
- Jun 02 2016 cond-mat.stat-mech stat.ML arXiv:1606.00318v2Unsupervised learning is a discipline of machine learning which aims at discovering patterns in big data sets or classifying the data into several categories without being trained explicitly. We show that unsupervised learning techniques can be readily used to identify phases and phases transitions of many body systems. Starting with raw spin configurations of a prototypical Ising model, we use principal component analysis to extract relevant low dimensional representations the original data and use clustering analysis to identify distinct phases in the feature space. This approach successfully finds out physical concepts such as order parameter and structure factor to be indicators of the phase transition. We discuss future prospects of discovering more complex phases and phase transitions using unsupervised learning techniques.
- In this paper, we study the estimation of partially linear models for spatial data distributed over complex domains. We use bivariate splines over triangulations to represent the nonparametric component on an irregular two-dimensional domain. The proposed method is formulated as a constrained minimization problem which does not require constructing finite elements or locally supported basis functions. Thus, it allows an easier implementation of piecewise polynomial representations of various degrees and various smoothness over an arbitrary triangulation. Moreover, the constrained minimization problem is converted into an unconstrained minimization via a QR decomposition of the smoothness constraints, which leads to a penalized least squares method to estimate the model. The estimators of the parameters are proved to be asymptotically normal under some regularity conditions. The estimator of the bivariate function is consistent, and its rate of convergence is also established. The proposed method enables us to construct confidence intervals and permits inference for the parameters. The performance of the estimators is evaluated by two simulation examples and by a real data analysis.
- May 24 2016 stat.ME arXiv:1605.06837v1Many biological phenomena undergo developmental changes in time and space. Functional mapping, which is aimed at mapping genes that affect developmental patterns, is instrumental for studying the genetic architecture of biological changes. Often biological processes are mediated by a network of developmental and physiological components and, therefore, are better described by multiple phenotypes. In this article, we develop a multivariate model for functional mapping that can detect and characterize quantitative trait loci (QTLs) that simultaneously control multiple dynamic traits. Because the true genotypes of QTLs are unknown, the measurements for the multiple dynamic traits are modeled using a mixture distribution. The functional means of the multiple dynamic traits are estimated using the nonparametric regression method, which avoids any parametric assumption on the functional means. We propose the profile likelihood method to estimate the mixture model. A likelihood ratio test is exploited to test for the existence of pleiotropic effects on distinct but developmentally correlated traits. A simulation study is implemented to illustrate the finite sample performance of our proposed method. We also demonstrate our method by identifying QTLs that simultaneously control three dynamic traits of soybeans. The three dynamic traits are the time-course biomass of the leaf, the stem, and the root of the whole soybean. The genetic linkage map is constructed with 950 microsatellite markers. The new model can aid in our comprehension of the genetic control mechanisms of complex dynamic traits over time.
- May 18 2016 stat.ME arXiv:1605.05309v1It is common that in medical studies, the outcome of interest is truncated by death, meaning that a subject had died before the outcome could be measured. In this case, restricted analysis among survivors may be subject to selection bias. It is hence of interest to estimate the survivor average causal effect (SACE), defined as the average causal effect among subjects who would survive under either exposure. In this paper, we consider the identification and estimation problems of the SACE. We propose to use a "substitution variable" in place of the latent membership in the always-survivor group. The identifiability conditions required for a "substitution variable" are conceptually similar to conditions for an instrumental variable. We show that the SACE is identifiable with use of such a "substitution variable," and propose novel model parameterizations for estimation of the SACE under our identification assumptions. Our approaches are illustrated via simulation studies and two data analyses.
- May 13 2016 stat.ME arXiv:1605.03677v2Instrumental variables are widely used for estimating causal effects in the presence of unmeasured confounding. The discrete instrumental variable model has testable implications on the law of the observed data. However, current assessments of instrumental validity are typically based solely on subject-matter arguments rather than these testable implications, partly due to a lack of formal statistical tests with known properties. In this paper, we develop simple procedures for testing the binary instrumental variable model. Our methods are based on existing approaches for comparing two treatments, such as the t-test and the Gail--Simon test. We illustrate the importance of testing the instrumental variable model by evaluating the exogeneity of college proximity using the National Longitudinal Survey of Young Men.
- In this paper, we propose a new type of array antenna, termed the Random Frequency Diverse Array (RFDA), for an uncoupled indication of target direction and range with low system complexity. In RFDA, each array element has a narrow bandwidth and a randomly assigned carrier frequency. The beampattern of the array is shown to be stochastic but thumbtack-like, and its stochastic characteristics, such as the mean, variance, and asymptotic distribution are derived analytically. Based on these two features, we propose two kinds of algorithms for signal processing. One is matched filtering, due to the beampattern's good characteristics. The other is compressive sensing, because the new approach can be regarded as a sparse and random sampling of target information in the spatial-frequency domain. Fundamental limits, such as the Cramér-Rao bound and the observing matrix's mutual coherence, are provided as performance guarantees of the new array structure. The features and performances of RFDA are verified with numerical results.
- In this article, we propose new Bayesian methods for selecting and estimating a sparse coefficient vector for skewed heteroscedastic response. Our novel Bayesian procedures effectively estimate the median and other quantile functions, accommodate non-local prior for regression effects without compromising ease of implementation via sampling based tools, and asymptotically select the true set of predictors even when the number of covariates increases in the same order of the sample size. We also extend our method to deal with some observations with very large errors. Via simulation studies and a re-analysis of a medical cost study with large number of potential predictors, we illustrate the ease of implementation and other practical advantages of our approach compared to existing methods for such studies.
- Feb 23 2016 stat.ME arXiv:1602.06366v1The propensity score plays a central role in inferring causal effects from observational studies. In particular, weighting and subclassification are two principal approaches to estimate the average causal effect based on estimated propensity scores. Unlike the conventional version of the propensity score subclassification estimator, if the propensity score model is correctly specified, the weighting methods offer consistent and possibly efficient estimation of the average causal effect. However, this theoretical appeal may be diminished in practice by sensitivity to misspecification of the propensity score model. In contrast, subclassification methods are usually more robust to model misspecification. We hence propose to use subclassification for robust estimation of propensity score weights. Our approach is based on the intuition that the inverse probability weighting estimator can be seen as the limit of subclassification estimators as the number of subclasses goes to infinity. By formalizing this intuition, we propose novel propensity score weighting estimators that are both consistent and robust to model misspecification. Empirical studies show that the proposed estimators perform favorably compared to existing methods.
- We consider a flexible semiparametric quantile regression model for analyzing high dimensional heterogeneous data. This model has several appealing features: (1) By considering different conditional quantiles, we may obtain a more complete picture of the conditional distribution of a response variable given high dimensional covariates. (2) The sparsity level is allowed to be different at different quantile levels. (3) The partially linear additive structure accommodates nonlinearity and circumvents the curse of dimensionality. (4) It is naturally robust to heavy-tailed distributions. In this paper, we approximate the nonlinear components using B-spline basis functions. We first study estimation under this model when the nonzero components are known in advance and the number of covariates in the linear part diverges. We then investigate a nonconvex penalized estimator for simultaneous variable selection and estimation. We derive its oracle property for a general class of nonconvex penalty functions in the presence of ultra-high dimensional covariates under relaxed conditions. To tackle the challenges of nonsmooth loss function, nonconvex penalty function and the presence of nonlinear components, we combine a recently developed convex-differencing method with modern empirical process techniques. Monte Carlo simulations and an application to a microarray study demonstrate the effectiveness of the proposed method. We also discuss how the method for a single quantile of interest can be extended to simultaneous variable selection and estimation at multiple quantiles.
- Many scientific datasets are of high dimension, and the analysis usually requires visual manipulation by retaining the most important structures of data. Principal curve is a widely used approach for this purpose. However, many existing methods work only for data with structures that are not self-intersected, which is quite restrictive for real applications. A few methods can overcome the above problem, but they either require complicated human-made rules for a specific task with lack of convergence guarantee and adaption flexibility to different tasks, or cannot obtain explicit structures of data. To address these issues, we develop a new regularized principal graph learning framework that captures the local information of the underlying graph structure based on reversed graph embedding. As showcases, models that can learn a spanning tree or a weighted undirected $\ell_1$ graph are proposed, and a new learning algorithm is developed that learns a set of principal points and a graph structure from data, simultaneously. The new algorithm is simple with guaranteed convergence. We then extend the proposed framework to deal with large-scale data. Experimental results on various synthetic and six real world datasets show that the proposed method compares favorably with baselines and can uncover the underlying structure correctly.
- In this paper, we propose a new regularization technique called "functional SCAD". We then combine this technique with the smoothing spline method to develop a smooth and locally sparse (i.e., zero on some sub-regions) estimator for the coefficient function in functional linear regression. The functional SCAD has a nice shrinkage property that enables our estimating procedure to identify the null subregions of the coefficient function without over shrinking the non-zero values of the coefficient function. Additionally, the smoothness of our estimated coefficient function is regularized by a roughness penalty rather than by controlling the number of knots. Our method is more theoretically sound and is computationally simpler than the other available methods. An asymptotic analysis shows that our estimator is consistent and can identify the null region with the probability tending to one. Furthermore, simulation studies show that our estimator has superior numerical performance. Finally, the practical merit of our method is demonstrated on two real applications.
- A common problem in formulating models for the relative risk and risk difference is the variation dependence between these parameters and the baseline risk, which is a nuisance model. We address this problem by proposing the conditional log odds-product as a preferred nuisance model. This novel nuisance model facilitates maximum-likelihood estimation, but also permits doubly-robust estimation for the parameters of interest. Our approach is illustrated via simulations and a data analysis.
- Efficient index structures for fast approximate nearest neighbor queries are required in many applications such as recommendation systems. In high-dimensional spaces, many conventional methods suffer from excessive usage of memory and slow response times. We propose a method where multiple random projection trees are combined by a novel voting scheme. The key idea is to exploit the redundancy in a large number of candidate sets obtained by independently generated random projections in order to reduce the number of expensive exact distance evaluations. The method is straightforward to implement using sparse projections which leads to a reduced memory footprint and fast index construction. Furthermore, it enables grouping of the required computations into big matrix multiplications, which leads to additional savings due to cache effects and low-level parallelization. We demonstrate by extensive experiments on a wide variety of data sets that the method is faster than existing partitioning tree or hashing based approaches, making it the fastest available technique on high accuracy levels.
- Aug 14 2015 stat.ML arXiv:1508.03106v2Most existing binary classification methods target on the optimization of the overall classification risk and may fail to serve some real-world applications such as cancer diagnosis, where users are more concerned with the risk of misclassifying one specific class than the other. Neyman-Pearson (NP) paradigm was introduced in this context as a novel statistical framework for handling asymmetric type I/II error priorities. It seeks classifiers with a minimal type II error and a constrained type I error under a user specified level. This article is the first attempt to construct classifiers with guaranteed theoretical performance under the NP paradigm in high-dimensional settings. Based on the fundamental Neyman-Pearson Lemma, we used a plug-in approach to construct NP-type classifiers for Naive Bayes models. The proposed classifiers satisfy the NP oracle inequalities, which are natural NP paradigm counterparts of the oracle inequalities in classical binary classification. Besides their desirable theoretical properties, we also demonstrated their numerical advantages in prioritized error control via both simulation and real data studies.
- May 19 2015 stat.ME arXiv:1505.04493v3Comparing large covariance matrices has important applications in modern genomics, where scientists are often interested in understanding whether relationships (e.g., dependencies or co-regulations) among a large number of genes vary between different biological states. We propose a computationally fast procedure for testing the equality of two large covariance matrices when the dimensions of the covariance matrices are much larger than the sample sizes. A distinguishing feature of the new procedure is that it imposes no structural assumptions on the unknown covariance matrices. Hence the test is robust with respect to various complex dependence structures that frequently arise in genomics. We prove that the proposed procedure is asymptotically valid under weak moment conditions. As an interesting application, we derive a new gene clustering algorithm which shares the same nice property of avoiding restrictive structural assumptions for high-dimensional genomics data. Using an asthma gene expression dataset, we illustrate how the new test helps compare the covariance matrices of the genes across different gene sets/pathways between the disease group and the control group, and how the gene clustering algorithm provides new insights on the way gene clustering patterns differ between the two groups. The proposed methods have been implemented in an R-package HDtest and is available on CRAN.
- May 11 2015 stat.ME arXiv:1505.02118v3It is common that in multiarm randomized trials, the outcome of interest is "truncated by death," meaning that it is only observed or well defined conditioning on an intermediate outcome. In this case, in addition to pairwise contrasts, the joint inference for all treatment arms is also of interest. Under a monotonicity assumption we present methods for both pairwise and joint causal analyses of ordinal treatments and binary outcomes in presence of truncation by death. We illustrate via examples the appropriateness of our assumptions in different scientific contexts.
- This article is concerned with the spectral behavior of $p$-dimensional linear processes in the moderately high-dimensional case when both dimensionality $p$ and sample size $n$ tend to infinity so that $p/n\to0$. It is shown that, under an appropriate set of assumptions, the empirical spectral distributions of the renormalized and symmetrized sample autocovariance matrices converge almost surely to a nonrandom limit distribution supported on the real line. The key assumption is that the linear process is driven by a sequence of $p$-dimensional real or complex random vectors with i.i.d. entries possessing zero mean, unit variance and finite fourth moments, and that the $p\times p$ linear process coefficient matrices are Hermitian and simultaneously diagonalizable. Several relaxations of these assumptions are discussed. The results put forth in this paper can help facilitate inference on model parameters, model diagnostics and prediction of future values of the linear process.
- This paper offers a characterization of fundamental limits on the classification and reconstruction of high-dimensional signals from low-dimensional features, in the presence of side information. We consider a scenario where a decoder has access both to linear features of the signal of interest and to linear features of the side information signal; while the side information may be in a compressed form, the objective is recovery or classification of the primary signal, not the side information. The signal of interest and the side information are each assumed to have (distinct) latent discrete labels; conditioned on these two labels, the signal of interest and side information are drawn from a multivariate Gaussian distribution. With joint probabilities on the latent labels, the overall signal-(side information) representation is defined by a Gaussian mixture model. We then provide sharp sufficient and/or necessary conditions for these quantities to approach zero when the covariance matrices of the Gaussians are nearly low-rank. These conditions, which are reminiscent of the well-known Slepian-Wolf and Wyner-Ziv conditions, are a function of the number of linear features extracted from the signal of interest, the number of linear features extracted from the side information signal, and the geometry of these signals and their interplay. Moreover, on assuming that the signal of interest and the side information obey such an approximately low-rank model, we derive expansions of the reconstruction error as a function of the deviation from an exactly low-rank model; such expansions also allow identification of operational regimes where the impact of side information on signal reconstruction is most relevant. Our framework, which offers a principled mechanism to integrate side information in high-dimensional data problems, is also tested in the context of imaging applications.
- We propose generalized additive partial linear models for complex data which allow one to capture nonlinear patterns of some covariates, in the presence of linear components. The proposed method improves estimation efficiency and increases statistical power for correlated data through incorporating the correlation information. A unique feature of the proposed method is its capability of handling model selection in cases where it is difficult to specify the likelihood function. We derive the quadratic inference function-based estimators for the linear coefficients and the nonparametric functions when the dimension of covariates diverges, and establish asymptotic normality for the linear coefficient estimators and the rates of convergence for the nonparametric functions estimators for both finite and high-dimensional cases. The proposed method and theoretical development are quite challenging since the numbers of linear covariates and nonlinear components both increase as the sample size increases. We also propose a doubly penalized procedure for variable selection which can simultaneously identify nonzero linear and nonparametric components, and which has an asymptotic oracle property. Extensive Monte Carlo studies have been conducted and show that the proposed procedure works effectively even with moderate sample sizes. A pharmacokinetics study on renal cancer data is illustrated using the proposed method.
- Jan 10 2014 stat.ME arXiv:1401.2081v1A method using multiple imputation and bootstrap for dealing with missing data in mediation analysis is introduced and implemented in SAS. Through simulation studies, it is shown that the method performs well for both MCAR and MAR data without and with auxiliary variables. It is also shown that the method works equally well for MNAR data if auxiliary variables related to missingness are included. The application of the method is demonstrated through the analysis of a subset of data from the National Longitudinal Survey of Youth.
- We consider accurately answering smooth queries while preserving differential privacy. A query is said to be $K$-smooth if it is specified by a function defined on $[-1,1]^d$ whose partial derivatives up to order $K$ are all bounded. We develop an $\epsilon$-differentially private mechanism for the class of $K$-smooth queries. The major advantage of the algorithm is that it outputs a synthetic database. In real applications, a synthetic database output is appealing. Our mechanism achieves an accuracy of $O (n^{-\frac{K}{2d+K}}/\epsilon )$, and runs in polynomial time. We also generalize the mechanism to preserve $(\epsilon, \delta)$-differential privacy with slightly improved accuracy. Extensive experiments on benchmark datasets demonstrate that the mechanisms have good accuracy and are efficient.
- In this paper we present two new approaches to efficiently solve large-scale compressed sensing problems. These two ideas are independent of each other and can therefore be used either separately or together. We consider all possibilities. For the first approach, we note that the zero vector can be taken as the initial basic (infeasible) solution for the linear programming problem and therefore, if the true signal is very sparse, some variants of the simplex method can be expected to take only a small number of pivots to arrive at a solution. We implemented one such variant and demonstrate a dramatic improvement in computation time on very sparse signals. The second approach requires a redesigned sensing mechanism in which the vector signal is stacked into a matrix. This allows us to exploit the Kronecker compressed sensing (KCS) mechanism. We show that the Kronecker sensing requires stronger conditions for perfect recovery compared to the original vector problem. However, the Kronecker sensing, modeled correctly, is a much sparser linear optimization problem. Hence, algorithms that benefit from sparse problem representation, such as interior-point methods, can solve the Kronecker sensing problems much faster than the corresponding vector problem. In our numerical studies, we demonstrate a ten-fold improvement in the computation time.
- Nov 28 2013 stat.ME arXiv:1311.6844v1Due to measurement noise, a common problem in in various fields is how to estimate the ratio of two functions. We consider this problem of estimating the ratio of two functions in a nonparametric regression model. Assuming the noise is normally distributed, this is equivalent to estimating the ratio of the means of two normally distributed random variables. We identified a consistent estimator that gives the mean squared loss of order $O(1/n)$ ($n$ is the sample size) when conditioned on a highly probable event. We also present our result applied to both the real data from EAPS and on simulated data, confirming our theoretical results.
- We investigate high-dimensional nonconvex penalized regression, where the number of covariates may grow at an exponential rate. Although recent asymptotic theory established that there exists a local minimum possessing the oracle property under general conditions, it is still largely an open problem how to identify the oracle estimator among potentially multiple local minima. There are two main obstacles: (1) due to the presence of multiple minima, the solution path is nonunique and is not guaranteed to contain the oracle estimator; (2) even if a solution path is known to contain the oracle estimator, the optimal tuning parameter depends on many unknown factors and is hard to estimate. To address these two challenging issues, we first prove that an easy-to-calculate calibrated CCCP algorithm produces a consistent solution path which contains the oracle estimator with probability approaching one. Furthermore, we propose a high-dimensional BIC criterion and show that it can be applied to the solution path to select the optimal tuning parameter which asymptotically identifies the oracle estimator. The theory for a general class of nonconvex penalties in the ultra-high dimensional setup is established when the random errors follow the sub-Gaussian distribution. Monte Carlo studies confirm that the calibrated CCCP algorithm combined with the proposed high-dimensional BIC has desirable performance in identifying the underlying sparsity pattern for high-dimensional data analysis.
- In this note the range-renewal structure of general independent and identically distributed samplings is discussed, which is a natural extension of the result in Chen et al (arXiv:1305.1829).
- Many problems of practical interest rely on Continuous-time Markov chains~(CTMCs) defined over combinatorial state spaces, rendering the computation of transition probabilities, and hence probabilistic inference, difficult or impossible with existing methods. For problems with countably infinite states, where classical methods such as matrix exponentiation are not applicable, the main alternative has been particle Markov chain Monte Carlo methods imputing both the holding times and sequences of visited states. We propose a particle-based Monte Carlo approach where the holding times are marginalized analytically. We demonstrate that in a range of realistic inferential setups, our scheme dramatically reduces the variance of the Monte Carlo approximation and yields more accurate parameter posterior approximations given a fixed computational budget. These experiments are performed on both synthetic and real datasets, drawing from two important examples of CTMCs having combinatorial state spaces: string-valued mutation models in phylogenetics and nucleic acid folding pathways.
- We are concerned with the behavior of the eigenvalues of renormalized sample covariance matrices of the form C_n=\sqrt\fracnp\left(\frac1nA_p^1/2X_nB_nX_n^*A_p^1/2-\frac1n\tr(B_n)A_p\right) as $p,n\to \infty$ and $p/n\to 0$, where $X_{n}$ is a $p\times n$ matrix with i.i.d. real or complex valued entries $X_{ij}$ satisfying $E(X_{ij})=0$, $E|X_{ij}|^2=1$ and having finite fourth moment. $A_{p}^{1/2}$ is a square-root of the nonnegative definite Hermitian matrix $A_{p}$, and $B_{n}$ is an $n\times n$ nonnegative definite Hermitian matrix. We show that the empirical spectral distribution (ESD) of $C_n$ converges a.s. to a nonrandom limiting distribution under some assumptions. The probability density function of the LSD of $C_{n}$ is derived and it is shown that it depends on the LSD of $A_{p}$ and the limiting value of $n^{-1}\tr(B_{n}^2)$. We propose a computational algorithm for evaluating this limiting density when the LSD of $A_{p}$ is a mixture of point masses. In addition, when the entries of $X_{n}$ are sub-Gaussian, we derive the limiting empirical distribution of $\{\sqrt{n/p}(\lambda_j(S_n) - n^{-1}\tr(B_n) \lambda_j(A_{p}))\}_{j=1}^p$ where $S_n := n^{-1} A_{p}^{1/2}X_{n}B_{n}X_{n}^{*}A_{p}^{1/2}$ is the sample covariance matrix and $\lambda_j$ denotes the $j$-th largest eigenvalue, when $F^A$ is a finite mixture of point masses. These results are utilized to propose a test for the covariance structure of the data where the null hypothesis is that the joint covariance matrix is of the form $A_{p} \otimes B_n$ for $\otimes$ denoting the Kronecker product, as well as $A_{p}$ and the first two spectral moments of $B_n$ are specified. The performance of this test is illustrated through a simulation study.
- May 13 2013 stat.ML arXiv:1305.2238v2We propose a calibrated multivariate regression method named CMR for fitting high dimensional multivariate regression models. Compared with existing methods, CMR calibrates regularization for each regression task with respect to its noise level so that it simultaneously attains improved finite-sample performance and tuning insensitiveness. Theoretically, we provide sufficient conditions under which CMR achieves the optimal rate of convergence in parameter estimation. Computationally, we propose an efficient smoothed proximal gradient algorithm with a worst-case numerical rate of convergence $\cO(1/\epsilon)$, where $\epsilon$ is a pre-specified accuracy of the objective function value. We conduct thorough numerical simulations to illustrate that CMR consistently outperforms other high dimensional multivariate regression methods. We also apply CMR to solve a brain activity prediction problem and find that it is as competitive as a handcrafted model created by human experts. The R package \textttcamel implementing the proposed method is available on the Comprehensive R Archive Network \urlhttp://cran.r-project.org/web/packages/camel/.
- A central problem in ranking is to design a ranking measure for evaluation of ranking functions. In this paper we study, from a theoretical perspective, the widely used Normalized Discounted Cumulative Gain (NDCG)-type ranking measures. Although there are extensive empirical studies of NDCG, little is known about its theoretical properties. We first show that, whatever the ranking function is, the standard NDCG which adopts a logarithmic discount, converges to 1 as the number of items to rank goes to infinity. On the first sight, this result is very surprising. It seems to imply that NDCG cannot differentiate good and bad ranking functions, contradicting to the empirical success of NDCG in many applications. In order to have a deeper understanding of ranking measures in general, we propose a notion referred to as consistent distinguishability. This notion captures the intuition that a ranking measure should have such a property: For every pair of substantially different ranking functions, the ranking measure can decide which one is better in a consistent manner on almost all datasets. We show that NDCG with logarithmic discount has consistent distinguishability although it converges to the same limit for all ranking functions. We next characterize the set of all feasible discount functions for NDCG according to the concept of consistent distinguishability. Specifically we show that whether NDCG has consistent distinguishability depends on how fast the discount decays, and 1/r is a critical point. We then turn to the cut-off version of NDCG, i.e., NDCG@k. We analyze the distinguishability of NDCG@k for various choices of k and the discount functions. Experimental results on real Web search datasets agree well with the theory.
- We introduce a quantile-adaptive framework for nonlinear variable screening with high-dimensional heterogeneous data. This framework has two distinctive features: (1) it allows the set of active variables to vary across quantiles, thus making it more flexible to accommodate heterogeneity; (2) it is model-free and avoids the difficult task of specifying the form of a statistical model in a high dimensional space. Our nonlinear independence screening procedure employs spline approximations to model the marginal effects at a quantile level of interest. Under appropriate conditions on the quantile functions without requiring the existence of any moments, the new procedure is shown to enjoy the sure screening property in ultra-high dimensions. Furthermore, the quantile-adaptive framework can naturally handle censored data arising in survival analysis. We prove that the sure screening property remains valid when the response variable is subject to random right censoring. Numerical studies confirm the fine performance of the proposed method for various semiparametric models and its effectiveness to extract quantile-specific information from heteroscedastic data.
- Recently, $l_{2,1}$ matrix norm has been widely applied to many areas such as computer vision, pattern recognition, biological study and etc. As an extension of $l_1$ vector norm, the mixed $l_{2,1}$ matrix norm is often used to find jointly sparse solutions. Moreover, an efficient iterative algorithm has been designed to solve $l_{2,1}$-norm involved minimizations. Actually, computational studies have showed that $l_p$-regularization ($0<p<1$) is sparser than $l_1$-regularization, but the extension to matrix norm has been seldom considered. This paper presents a definition of mixed $l_{2,p}$ $(p\in (0, 1])$ matrix pseudo norm which is thought as both generalizations of $l_p$ vector norm to matrix and $l_{2,1}$-norm to nonconvex cases $(0<p<1)$. Fortunately, an efficient unified algorithm is proposed to solve the induced $l_{2,p}$-norm $(p\in (0, 1])$ optimization problems. The convergence can also be uniformly demonstrated for all $p\in (0, 1]$. Typical $p\in (0,1]$ are applied to select features in computational biology and the experimental results show that some choices of $0<p<1$ do improve the sparse pattern of using $p=1$.
- Matching Pursuit LASSIn Part I \citeTanPMLPart1, a Matching Pursuit LASSO (MPL) algorithm has been presented for solving large-scale sparse recovery (SR) problems. In this paper, we present a subspace search to further improve the performance of MPL, and then continue to address another major challenge of SR -- batch SR with many signals, a consideration which is absent from most of previous $\ell_1$-norm methods. As a result, a batch-mode MPL is developed to vastly speed up sparse recovery of many signals simultaneously. Comprehensive numerical experiments on compressive sensing and face recognition tasks demonstrate the superior performance of MPL and BMPL over other methods considered in this paper, in terms of sparse recovery ability and efficiency. In particular, BMPL is up to 400 times faster than existing $\ell_1$-norm methods considered to be state-of-the-art.O Part II: Applications and Sparse Recovery over Batch Signals
- We consider the problem of simultaneous variable selection and estimation in additive, partially linear models for longitudinal/clustered data. We propose an estimation procedure via polynomial splines to estimate the nonparametric components and apply proper penalty functions to achieve sparsity in the linear part. Under reasonable conditions, we obtain the asymptotic normality of the estimators for the linear components and the consistency of the estimators for the nonparametric components. We further demonstrate that, with proper choice of the regularization parameter, the penalized estimators of the non-zero coefficients achieve the asymptotic oracle property. The finite sample behavior of the penalized estimators is evaluated with simulation studies and illustrated by a longitudinal CD4 cell count data set.