results for au:Gu_Q in:stat

- We present a unified framework to analyze the global convergence of Langevin dynamics based algorithms for nonconvex finite-sum optimization with $n$ component functions. At the core of our analysis is a new decomposition scheme of the optimization error, under which we directly analyze the ergodicity of the numerical approximations of Langevin dynamics and prove sharp convergence rates. We establish the first global convergence guarantee of gradient Langevin dynamics (GLD) with iteration complexity $O\big(1/\epsilon \cdot\log(1/\epsilon)\big)$. In addition, we improve the convergence rate of stochastic gradient Langevin dynamics (SGLD) to the "almost minimizer", which does not depend on the undesirable uniform spectral gap introduced in previous studies. Furthermore, we for the first time prove the global convergence guarantee of variance reduced stochastic gradient Langevin dynamics (VR-SGLD) with iteration complexity $O\big(m/(B\epsilon^3)\cdot\log(1/\epsilon)\big)$, where $B$ is the mini-batch size and $m$ is the length of the inner loop. We show that the gradient complexity of VR-SGLD is $O\big(n^{1/2}/\epsilon^{3/2}\cdot\log(1/\epsilon)\big)$, which outperforms $O\big(n/\epsilon\cdot\log(1/\epsilon)\big)$ gradient complexity of GLD, when the number of component functions satisfies $n \geq 1/\epsilon$. Our theoretical analysis shed some light on using Langevin dynamics based algorithms for nonconvex optimization with provable guarantees.
- High dimensional superposition models characterize observations using parameters which can be written as a sum of multiple component parameters, each with its own structure, e.g., sum of low rank and sparse matrices, sum of sparse and rotated sparse vectors, etc. In this paper, we consider general superposition models which allow sum of any number of component parameters, and each component structure can be characterized by any norm. We present a simple estimator for such models, give a geometric condition under which the components can be accurately estimated, characterize sample complexity of the estimator, and give high probability non-asymptotic bounds on the componentwise estimation error. We use tools from empirical processes and generic chaining for the statistical analysis, and our results, which substantially generalize prior work on superposition models, are in terms of Gaussian widths of suitable sets.
- We consider the phase retrieval problem of recovering the unknown signal from the magnitude-only measurements, where the measurements can be contaminated by both sparse arbitrary corruption and bounded random noise. We propose a new nonconvex algorithm for robust phase retrieval, namely Robust Wirtinger Flow, to jointly estimate the unknown signal and the sparse corruption. We show that our proposed algorithm is guaranteed to converge linearly to the unknown true signal up to a minimax optimal statistical precision in such a challenging setting. Compared with existing robust phase retrieval methods, we improved the statistical error rate by a factor of $\sqrt{n/m}$ where $n$ is the dimension of the signal and $m$ is the sample size, provided a refined characterization of the corruption fraction requirement, and relaxed the lower bound condition on the number of corruption. In the noise-free case, our algorithm converges to the unknown signal at a linear rate and achieves optimal sample complexity up to a logarithm factor. Thorough experiments on both synthetic and real datasets corroborate our theory.
- We study the estimation of the latent variable Gaussian graphical model (LVGGM), where the precision matrix is the superposition of a sparse matrix and a low-rank matrix. In order to speed up the estimation of the sparse plus low-rank components, we propose a sparsity constrained maximum likelihood estimator based on matrix factorization, and an efficient alternating gradient descent algorithm with hard thresholding to solve it. Our algorithm is orders of magnitude faster than the convex relaxation based methods for LVGGM. In addition, we prove that our algorithm is guaranteed to linearly converge to the unknown sparse and low-rank components up to the optimal statistical precision. Experiments on both synthetic and genomic data demonstrate the superiority of our algorithm over the state-of-the-art algorithms and corroborate our theory.
- Feb 22 2017 stat.ML arXiv:1702.06525v2We study the problem of low-rank plus sparse matrix recovery. We propose a generic and efficient nonconvex optimization algorithm based on projected gradient descent and double thresholding operator, with much lower computational complexity. Compared with existing convex-relaxation based methods, the proposed algorithm recovers the low-rank plus sparse matrices for free, without incurring any additional statistical cost. It not only enables exact recovery of the unknown low-rank and sparse matrices in the noiseless setting, and achieves minimax optimal statistical error rate in the noisy case, but also matches the best-known robustness guarantee (i.e., tolerance for sparse corruption). At the core of our theory is a novel structural Lipschitz gradient condition for low-rank plus sparse matrices, which is essential for proving the linear convergence rate of our algorithm, and we believe is of independent interest to prove fast rates for general superposition-structured models. We demonstrate the superiority of our generic algorithm, both theoretically and experimentally, through three concrete applications: robust matrix sensing, robust PCA and one-bit matrix decomposition.
- Jan 10 2017 stat.ML arXiv:1701.02301v2We propose a generic framework based on a new stochastic variance-reduced gradient descent algorithm for accelerating nonconvex low-rank matrix recovery. Starting from an appropriate initial estimator, our proposed algorithm performs projected gradient descent based on a novel semi-stochastic gradient specifically designed for low-rank matrix recovery. Based upon the mild restricted strong convexity and smoothness conditions, we derive a projected notion of the restricted Lipschitz continuous gradient property, and prove that our algorithm enjoys linear convergence rate to the unknown low-rank matrix with an improved computational complexity. Moreover, our algorithm can be employed to both noiseless and noisy observations, where the optimal sample complexity and the minimax optimal statistical rate can be attained respectively. We further illustrate the superiority of our generic framework through several specific examples, both theoretically and experimentally.
- Jan 03 2017 stat.ML arXiv:1701.00481v2We study the problem of estimating low-rank matrices from linear measurements (a.k.a., matrix sensing) through nonconvex optimization. We propose an efficient stochastic variance reduced gradient descent algorithm to solve a nonconvex optimization problem of matrix sensing. Our algorithm is applicable to both noisy and noiseless settings. In the case with noisy observations, we prove that our algorithm converges to the unknown low-rank matrix at a linear rate up to the minimax optimal statistical error. And in the noiseless setting, our algorithm is guaranteed to linearly converge to the unknown low-rank matrix and achieves exact recovery with optimal sample complexity. Most notably, the overall computational complexity of our proposed algorithm, which is defined as the iteration complexity times per iteration time complexity, is lower than the state-of-the-art algorithms based on gradient descent. Experiments on synthetic data corroborate the superiority of the proposed algorithm over the state-of-the-art algorithms.
- Dec 30 2016 stat.ML arXiv:1612.09297v1We propose communication-efficient distributed estimation and inference methods for the transelliptical graphical model, a semiparametric extension of the elliptical distribution in the high dimensional regime. In detail, the proposed method distributes the $d$-dimensional data of size $N$ generated from a transelliptical graphical model into $m$ worker machines, and estimates the latent precision matrix on each worker machine based on the data of size $n=N/m$. It then debiases the local estimators on the worker machines and send them back to the master machine. Finally, on the master machine, it aggregates the debiased local estimators by averaging and hard thresholding. We show that the aggregated estimator attains the same statistical rate as the centralized estimator based on all the data, provided that the number of machines satisfies $m \lesssim \min\{N\log d/d,\sqrt{N/(s^2\log d)}\}$, where $s$ is the maximum number of nonzero entries in each column of the latent precision matrix. It is worth noting that our algorithm and theory can be directly applied to Gaussian graphical models, Gaussian copula graphical models and elliptical graphical models, since they are all special cases of transelliptical graphical models. Thorough experiments on synthetic data back up our theory.
- Oct 18 2016 stat.ML arXiv:1610.05275v1We propose a unified framework for estimating low-rank matrices through nonconvex optimization based on gradient descent algorithm. Our framework is quite general and can be applied to both noisy and noiseless observations. In the general case with noisy observations, we show that our algorithm is guaranteed to linearly converge to the unknown low-rank matrix up to minimax optimal statistical error, provided an appropriate initial estimator. While in the generic noiseless setting, our algorithm converges to the unknown low-rank matrix at a linear rate and enables exact recovery with optimal sample complexity. In addition, we develop a new initialization algorithm to provide a desired initial estimator, which outperforms existing initialization algorithms for nonconvex low-rank matrix estimation. We illustrate the superiority of our framework through three examples: matrix regression, matrix completion, and one-bit matrix completion. We also corroborate our theory through extensive experiments on synthetic data.
- Oct 18 2016 stat.ML arXiv:1610.04798v1We propose a communication-efficient distributed estimation method for sparse linear discriminant analysis (LDA) in the high dimensional regime. Our method distributes the data of size $N$ into $m$ machines, and estimates a local sparse LDA estimator on each machine using the data subset of size $N/m$. After the distributed estimation, our method aggregates the debiased local estimators from $m$ machines, and sparsifies the aggregated estimator. We show that the aggregated estimator attains the same statistical rate as the centralized estimation method, as long as the number of machines $m$ is chosen appropriately. Moreover, we prove that our method can attain the model selection consistency under a milder condition than the centralized method. Experiments on both synthetic and real datasets corroborate our theory.
- Jun 03 2016 stat.ML arXiv:1606.00832v1We propose a nonconvex estimator for joint multivariate regression and precision matrix estimation in the high dimensional regime, under sparsity constraints. A gradient descent algorithm with hard thresholding is developed to solve the nonconvex estimator, and it attains a linear rate of convergence to the true regression coefficients and precision matrix simultaneously, up to the statistical error. Compared with existing methods along this line of research, which have little theoretical guarantee, the proposed algorithm not only is computationally much more efficient with provable convergence guarantee, but also attains the optimal finite sample statistical rate up to a logarithmic factor. Thorough experiments on both synthetic and real datasets back up our theory.
- Dec 31 2015 stat.ML arXiv:1512.08861v1We study the fundamental tradeoffs between computational tractability and statistical accuracy for a general family of hypothesis testing problems with combinatorial structures. Based upon an oracle model of computation, which captures the interactions between algorithms and data, we establish a general lower bound that explicitly connects the minimum testing risk under computational budget constraints with the intrinsic probabilistic and combinatorial structures of statistical problems. This lower bound mirrors the classical statistical lower bound by Le Cam (1986) and allows us to quantify the optimal statistical performance achievable given limited computational budgets in a systematic fashion. Under this unified framework, we sharply characterize the statistical-computational phase transition for two testing problems, namely, normal mean detection and sparse principal component detection. For normal mean detection, we consider two combinatorial structures, namely, sparse set and perfect matching. For these problems we identify significant gaps between the optimal statistical accuracy that is achievable under computational tractability constraints and the classical statistical lower bounds. Compared with existing works on computational lower bounds for statistical problems, which consider general polynomial-time algorithms on Turing machines, and rely on computational hardness hypotheses on problems like planted clique detection, we focus on the oracle computational model, which covers a broad range of popular algorithms, and do not rely on unproven hypotheses. Moreover, our result provides an intuitive and concrete interpretation for the intrinsic computational intractability of high-dimensional statistical problems. One byproduct of our result is a lower bound for a strict generalization of the matrix permanent problem, which is of independent interest.
- May 19 2015 stat.ML arXiv:1505.04780v2We present a unified framework for low-rank matrix estimation with nonconvex penalties. We first prove that the proposed estimator attains a faster statistical rate than the traditional low-rank matrix estimator with nuclear norm penalty. Moreover, we rigorously show that under a certain condition on the magnitude of the nonzero singular values, the proposed estimator enjoys oracle property (i.e., exactly recovers the true rank of the matrix), besides attaining a faster rate. As far as we know, this is the first work that establishes the theory of low-rank matrix estimation with nonconvex penalties, confirming the advantages of nonconvex penalties for matrix completion. Numerical experiments on both synthetic and real world datasets corroborate our theory.
- Mar 05 2015 stat.ML arXiv:1503.01442v2Many high dimensional sparse learning problems are formulated as nonconvex optimization. A popular approach to solve these nonconvex optimization problems is through convex relaxations such as linear and semidefinite programming. In this paper, we study the statistical limits of convex relaxations. Particularly, we consider two problems: Mean estimation for sparse principal submatrix and edge probability estimation for stochastic block model. We exploit the sum-of-squares relaxation hierarchy to sharply characterize the limits of a broad class of convex relaxations. Our result shows statistical optimality needs to be compromised for achieving computational tractability using convex relaxations. Compared with existing results on computational lower bounds for statistical problems, which consider general polynomial-time algorithms and rely on computational hardness hypotheses on problems like planted clique detection, our theory focuses on a broad class of convex relaxations and does not rely on unproven hypotheses.
- Feb 10 2015 stat.ML arXiv:1502.02347v2This paper proposes a unified framework to quantify local and global inferential uncertainty for high dimensional nonparanormal graphical models. In particular, we consider the problems of testing the presence of a single edge and constructing a uniform confidence subgraph. Due to the presence of unknown marginal transformations, we propose a pseudo likelihood based inferential approach. In sharp contrast to the existing high dimensional score test method, our method is free of tuning parameters given an initial estimator, and extends the scope of the existing likelihood based inferential framework. Furthermore, we propose a U-statistic multiplier bootstrap method to construct the confidence subgraph. We show that the constructed subgraph is contained in the true graph with probability greater than a given nominal level. Compared with existing methods for constructing confidence subgraphs, our method does not rely on Gaussian or sub-Gaussian assumptions. The theoretical properties of the proposed inferential methods are verified by thorough numerical experiments and real data analysis.
- Dec 31 2014 stat.ML arXiv:1412.8729v2We provide a general theory of the expectation-maximization (EM) algorithm for inferring high dimensional latent variable models. In particular, we make two contributions: (i) For parameter estimation, we propose a novel high dimensional EM algorithm which naturally incorporates sparsity structure into parameter estimation. With an appropriate initialization, this algorithm converges at a geometric rate and attains an estimator with the (near-)optimal statistical rate of convergence. (ii) Based on the obtained estimator, we propose new inferential procedures for testing hypotheses and constructing confidence intervals for low dimensional components of high dimensional parameters. For a broad family of statistical models, our framework establishes the first computationally feasible approach for optimal estimation and asymptotic inference in high dimensions. Our theory is supported by thorough numerical results.
- Fisher score is one of the most widely used supervised feature selection methods. However, it selects each feature independently according to their scores under the Fisher criterion, which leads to a suboptimal subset of features. In this paper, we present a generalized Fisher score to jointly select features. It aims at finding an subset of features, which maximize the lower bound of traditional Fisher score. The resulting feature selection problem is a mixed integer programming, which can be reformulated as a quadratically constrained linear programming (QCLP). It is solved by cutting plane algorithm, in each iteration of which a multiple kernel learning problem is solved alternatively by multivariate ridge regression and projected gradient descent. Experiments on benchmark data sets indicate that the proposed method outperforms Fisher score as well as many other state-of-the-art feature selection methods.