results for au:Zhang_X in:stat

- We consider the phase retrieval problem of recovering the unknown signal from the magnitude-only measurements, where the measurements can be contaminated by both sparse arbitrary corruption and bounded random noise. We propose a new nonconvex algorithm for robust phase retrieval, namely Robust Wirtinger Flow, to jointly estimate the unknown signal and the sparse corruption. We show that our proposed algorithm is guaranteed to converge linearly to the unknown true signal up to a minimax optimal statistical precision in such a challenging setting. Compared with existing robust phase retrieval methods, we improved the statistical error rate by a factor of $\sqrt{n/m}$ where $n$ is the dimension of the signal and $m$ is the sample size, provided a refined characterization of the corruption fraction requirement, and relaxed the lower bound condition on the number of corruption. In the noise-free case, our algorithm converges to the unknown signal at a linear rate and achieves optimal sample complexity up to a logarithm factor. Thorough experiments on both synthetic and real datasets corroborate our theory.
- We consider the topic of multivariate regression on manifold-valued output, that is, for a multivariate observation, its output response lies on a manifold. Moreover, we propose a new regression model to deal with the presence of grossly corrupted manifold-valued responses, a bottleneck issue commonly encountered in practical scenarios. Our model first takes a correction step on the grossly corrupted responses via geodesic curves on the manifold, and then performs multivariate linear regression on the corrected data. This results in a nonconvex and nonsmooth optimization problem on manifolds. To this end, we propose a dedicated approach named PALMR, by utilizing and extending the proximal alternating linearized minimization techniques. Theoretically, we investigate its convergence property, where it is shown to converge to a critical point under mild conditions. Empirically, we test our model on both synthetic and real diffusion tensor imaging data, and show that our model outperforms other multivariate regression models when manifold-valued responses contain gross errors, and is effective in identifying gross errors.
- Feb 22 2017 stat.ML arXiv:1702.06525v2We study the problem of low-rank plus sparse matrix recovery. We propose a generic and efficient nonconvex optimization algorithm based on projected gradient descent and double thresholding operator, with much lower computational complexity. Compared with existing convex-relaxation based methods, the proposed algorithm recovers the low-rank plus sparse matrices for free, without incurring any additional statistical cost. It not only enables exact recovery of the unknown low-rank and sparse matrices in the noiseless setting, and achieves minimax optimal statistical error rate in the noisy case, but also matches the best-known robustness guarantee (i.e., tolerance for sparse corruption). At the core of our theory is a novel structural Lipschitz gradient condition for low-rank plus sparse matrices, which is essential for proving the linear convergence rate of our algorithm, and we believe is of independent interest to prove fast rates for general superposition-structured models. We demonstrate the superiority of our generic algorithm, both theoretically and experimentally, through three concrete applications: robust matrix sensing, robust PCA and one-bit matrix decomposition.
- Feb 20 2017 stat.ME arXiv:1702.05195v1This paper studies the sparse normal mean models under the empirical Bayes framework. We focus on the mixture priors with an atom at zero and a density component centered at a data driven location determined by maximizing the marginal likelihood or minimizing the Stein Unbiased Risk Estimate. We study the properties of the corresponding posterior median and posterior mean. In particular, the posterior median is a thresholding rule and enjoys the multi-direction shrinkage property that shrinks the observation toward either the origin or the data-driven location. The idea is extended by considering a finite mixture prior, which is flexible to model the cluster structure of the unknown means. We further generalize the results to heteroscedastic normal mean models. Specifically, we propose a semiparametric estimator which can be calculated efficiently by combining the familiar EM algorithm with the Pool-Adjacent-Violators algorithm for isotonic regression. The effectiveness of our methods is demonstrated via extensive numerical studies.
- Motivated by applications in biological science, we propose a novel test to assess the conditional mean dependence of a response variable on a large number of covariates. Our procedure is built on the martingale difference divergence recently proposed in Shao and Zhang (2014), and it is able to detect a certain type of departure from the null hypothesis of conditional mean independence without making any specific model assumptions. Theoretically, we establish the asymptotic normality of the proposed test statistic under suitable assumption on the eigenvalues of a Hermitian operator, which is constructed based on the characteristic function of the covariates. These conditions can be simplified under banded dependence structure on the covariates or Gaussian design. To account for heterogeneity within the data, we further develop a testing procedure for conditional quantile independence at a given quantile level and provide an asymptotic justification. Empirically, our test of conditional mean independence delivers comparable results to the competitor, which was constructed under the linear model framework, when the underlying model is linear. It significantly outperforms the competitor when the conditional mean admits a nonlinear form.
- Jan 24 2017 stat.ME arXiv:1701.06263v1In functional data analysis (FDA), covariance function is fundamental not only as a critical quantity for understanding elementary aspects of functional data but also as an indispensable ingredient for many advanced FDA methods. This paper develops a new class of nonparametric covariance function estimators in terms of various spectral regularizations of an operator associated with a reproducing kernel Hilbert space. Despite their nonparametric nature, the covariance estimators are automatically positive semi-definite without any additional modification steps. An unconventional representer theorem is established to provide a finite dimensional representation for this class of covariance estimators, which leads to a closed-form expression of the corresponding $L^2$ eigen-decomposition. Trace-norm regularization is particularly studied to further achieve a low-rank representation, another desirable property which leads to dimension reduction and is often needed in advanced FDA approaches. An efficient algorithm is developed based on the accelerated proximal gradient method. This resulted estimator is shown to enjoy an excellent rate of convergence under both fixed and random designs. The outstanding practical performance of the trace-norm-regularized covariance estimator is demonstrated by a simulation study and the analysis of a traffic dataset.
- Jan 18 2017 stat.AP arXiv:1701.04423v1We extend the model used in Gardiner et al. (2002) and Polverejan et al. (2003) through deriving an explicit expression for the joint probability density function of hospital charge and length of stay (LOS) under a general class of conditions. Using this joint density function, we can apply the full maximum likelihood method (FML) to estimate the effect of covariates on charge and LOS. By FML, the endogeneity issues arisen from the dependence between charge and LOS can be efficiently resolved. As an illustrative example, we apply our method to real charge and LOS data sampled from New York State Statewide Planning and Research Cooperative System 2013 (SPARCS 2013). We compare our fitting result with the fitting to the marginal LOS data generated by the widely used Phase-Type model, and conclude that our method is more efficient in fitting.
- Jan 17 2017 stat.ML arXiv:1701.04207v1Canonical correlation analysis (CCA) is a multivariate statistical technique for finding the linear relationship between two sets of variables. The kernel generalization of CCA named kernel CCA has been proposed to find nonlinear relations between datasets. Despite their wide usage, they have one common limitation that is the lack of sparsity in their solution. In this paper, we consider sparse kernel CCA and propose a novel sparse kernel CCA algorithm (SKCCA). Our algorithm is based on a relationship between kernel CCA and least squares. Sparsity of the dual transformations is introduced by penalizing the $\ell_{1}$-norm of dual vectors. Experiments demonstrate that our algorithm not only performs well in computing sparse dual transformations but also can alleviate the over-fitting problem of kernel CCA.
- This paper studies the problem of multivariate linear regression where a portion of the observations is grossly corrupted or is missing, and the magnitudes and locations of such occurrences are unknown in priori. To deal with this problem, we propose a new approach by explicitly consider the error source as well as its sparseness nature. An interesting property of our approach lies in its ability of allowing individual regression output elements or tasks to possess their unique noise levels. Moreover, despite working with a non-smooth optimization problem, our approach still guarantees to converge to its optimal solution. Experiments on synthetic data demonstrate the competitiveness of our approach compared with existing multivariate regression models. In addition, empirically our approach has been validated with very promising results on two exemplar real-world applications: The first concerns the prediction of \textitBig-Five personality based on user behaviors at social network sites (SNSs), while the second is 3D human hand pose estimation from depth images. The implementation of our approach and comparison methods as well as the involved datasets are made publicly available in support of the open-source and reproducible research initiatives.
- Jan 10 2017 stat.ML arXiv:1701.02301v2We propose a generic framework based on a new stochastic variance-reduced gradient descent algorithm for accelerating nonconvex low-rank matrix recovery. Starting from an appropriate initial estimator, our proposed algorithm performs projected gradient descent based on a novel semi-stochastic gradient specifically designed for low-rank matrix recovery. Based upon the mild restricted strong convexity and smoothness conditions, we derive a projected notion of the restricted Lipschitz continuous gradient property, and prove that our algorithm enjoys linear convergence rate to the unknown low-rank matrix with an improved computational complexity. Moreover, our algorithm can be employed to both noiseless and noisy observations, where the optimal sample complexity and the minimax optimal statistical rate can be attained respectively. We further illustrate the superiority of our generic framework through several specific examples, both theoretically and experimentally.
- Jan 03 2017 stat.ML arXiv:1701.00481v2We study the problem of estimating low-rank matrices from linear measurements (a.k.a., matrix sensing) through nonconvex optimization. We propose an efficient stochastic variance reduced gradient descent algorithm to solve a nonconvex optimization problem of matrix sensing. Our algorithm is applicable to both noisy and noiseless settings. In the case with noisy observations, we prove that our algorithm converges to the unknown low-rank matrix at a linear rate up to the minimax optimal statistical error. And in the noiseless setting, our algorithm is guaranteed to linearly converge to the unknown low-rank matrix and achieves exact recovery with optimal sample complexity. Most notably, the overall computational complexity of our proposed algorithm, which is defined as the iteration complexity times per iteration time complexity, is lower than the state-of-the-art algorithms based on gradient descent. Experiments on synthetic data corroborate the superiority of the proposed algorithm over the state-of-the-art algorithms.
- Oct 18 2016 stat.ML arXiv:1610.05275v1We propose a unified framework for estimating low-rank matrices through nonconvex optimization based on gradient descent algorithm. Our framework is quite general and can be applied to both noisy and noiseless observations. In the general case with noisy observations, we show that our algorithm is guaranteed to linearly converge to the unknown low-rank matrix up to minimax optimal statistical error, provided an appropriate initial estimator. While in the generic noiseless setting, our algorithm converges to the unknown low-rank matrix at a linear rate and enables exact recovery with optimal sample complexity. In addition, we develop a new initialization algorithm to provide a desired initial estimator, which outperforms existing initialization algorithms for nonconvex low-rank matrix estimation. We illustrate the superiority of our framework through three examples: matrix regression, matrix completion, and one-bit matrix completion. We also corroborate our theory through extensive experiments on synthetic data.
- Mobile big data contains vast statistical features in various dimensions, including spatial, temporal, and the underlying social domain. Understanding and exploiting the features of mobile data from a social network perspective will be extremely beneficial to wireless networks, from planning, operation, and maintenance to optimization and marketing. In this paper, we categorize and analyze the big data collected from real wireless cellular networks. Then, we study the social characteristics of mobile big data and highlight several research directions for mobile big data in the social computing areas.
- Sep 30 2016 stat.ME arXiv:1609.09380v1In this paper, we introduce a ${\mathcal L}_2$ type test for testing mutual independence and banded dependence structure for high dimensional data. The test is constructed based on the pairwise distance covariance and it accounts for the non-linear and non-monotone dependences among the data, which cannot be fully captured by the existing tests based on either Pearson correlation or rank correlation. Our test can be conveniently implemented in practice as the limiting null distribution of the test statistic is shown to be standard normal. It exhibits excellent finite sample performance in our simulation studies even when sample size is small albeit dimension is high, and is shown to successfully identify nonlinear dependence in empirical data analysis. On the theory side, asymptotic normality of our test statistic is shown under quite mild moment assumptions and with little restriction on the convergence rate of the dimension as a function of sample size. As a demonstration of good power properties for our distance covariance based test, we further show that an infeasible version of our test statistic has the rate optimality in the class of Gaussian distribution with equal correlation.
- The latent Dirichlet allocation (LDA) model is a widely-used latent variable model in machine learning for text analysis. Inference for this model typically involves a single-site collapsed Gibbs sampling step for latent variables associated with observations. The efficiency of the sampling is critical to the success of the model in practical large scale applications. In this article, we introduce a blocking scheme to the collapsed Gibbs sampler for the LDA model which can, with a theoretical guarantee, improve chain mixing efficiency. We develop two procedures, an O(K)-step backward simulation and an O(log K)-step nested simulation, to directly sample the latent variables within each block. We demonstrate that the blocking scheme achieves substantial improvements in chain mixing compared to the state of the art single-site collapsed Gibbs sampler. We also show that when the number of topics is over hundreds, the nested-simulation blocking scheme can achieve a significant reduction in computation time compared to the single-site sampler.
- There is a growing need for the ability to analyse interval-valued data. However, existing descriptive frameworks to achieve this ignore the process by which interval-valued data are typically constructed; namely by the aggregation of real-valued data generated from some underlying process. In this article we develop the foundations of likelihood based statistical inference for random intervals that directly incorporates the underlying generative procedure into the analysis. That is, it permits the direct fitting of models for the underlying real-valued data given only the random interval-valued summaries. This generative approach overcomes several problems associated with existing methods, including the rarely satisfied assumption of within-interval uniformity. The new methods are illustrated by simulated and real data analyses.
- The causal discovery of Bayesian networks is an active and important research area, and it is based upon searching the space of causal models for those which can best explain a pattern of probabilistic dependencies shown in the data. However, some of those dependencies are generated by causal structures involving variables which have not been measured, i.e., latent variables. Some such patterns of dependency "reveal" themselves, in that no model based solely upon the observed variables can explain them as well as a model using a latent variable. That is what latent variable discovery is based upon. Here we did a search for finding them systematically, so that they may be applied in latent variable discovery in a more rigorous fashion.
- Multinomial logistic regression is a popular tool in the arsenal of machine learning algorithms, yet scaling it to datasets with very large number of data points and classes has not been trivial. This is primarily because one needs to compute the log-partition function on every data point. This makes distributing the computation hard. In this paper, we present a distributed stochastic gradient descent based optimization method (DS-MLR) for scaling up multinomial logistic regression problems to very large data. Our algorithm exploits double-separability, an attractive property we observe in the objective functions of several models in machine learning, that allows us to achieve both data as well as model parallelism simultaneously. In addition to being parallelizable, our algorithm can also easily be made asynchronous. In order to demonstrate the effectiveness of our method, we solve a very large multi-class classification problem on the reddit dataset with data and parameter sizes of 200 GB and 300 GB respectively. Such a scale of data calls for simultaneous data and model parallelism which is where DS-MLR fits in.
- The r largest order statistics approach is widely used in extreme value analysis because it may use more information from the data than just the block maxima. In practice, the choice of r is critical. If r is too large, bias can occur; if too small, the variance of the estimator can be high. The limiting distribution of the r largest order statistics, denoted by GEVr, extends that of the block maxima. Two specification tests are proposed to select r sequentially. The first is a score test for the GEVr distribution. Due to the special characteristics of the GEVr distribution, the classical chi-square asymptotics cannot be used. The simplest approach is to use the parametric bootstrap, which is straightforward to implement but computationally expensive. An alternative fast weighted bootstrap or multiplier procedure is developed for computational efficiency. The second test uses the difference in estimated entropy between the GEVr and GEV(r-1) models, applied to the r largest order statistics and the r-1 largest order statistics, respectively. The asymptotic distribution of the difference statistic is derived. In a large scale simulation study, both tests held their size and had substantial power to detect various misspecification schemes. A new approach to address the issue of multiple, sequential hypotheses testing is adapted to this setting to control the false discovery rate or familywise error rate. The utility of the procedures is demonstrated with extreme sea level and precipitation data.
- Threshold selection is a critical issue for extreme value analysis with threshold-based approaches. Under suitable conditions, exceedances over a high threshold have been shown to follow the generalized Pareto distribution (GPD) asymptotically. In practice, however, the threshold must be chosen. If the chosen threshold is too low, the GPD approximation may not hold and bias can occur. If the threshold is chosen too high, reduced sample size increases the variance of parameter estimates. To process batch analyses, commonly used selection methods such as graphical diagnosis are subjective and cannot be automated, while computational methods may not be feasible. We propose to test a set of thresholds through the goodness-of-fit of the GPD for the exceedances, and select the lowest one, above which the data provides adequate fit to the GPD. Previous attempts in this setting are not valid due to the special feature that the multiple tests are done in an ordered fashion. We apply two recently available stopping rules that control the false discovery rate or familywise error rate to ordered goodness-of-fit tests to automate threshold selection. Various model specification tests such as the Cramer-von Mises, Anderson-Darling, Moran's, and a score test are investigated. The performance of the method is assessed in a large scale simulation study that mimics practical return level estimation. This procedure was repeated at hundreds of sites in the western US to generate return level maps of extreme precipitation.
- Multiple-input multiple-output (MIMO) radar has become a thriving subject of research during the past decades. In the MIMO radar context, it is sometimes more accurate to model the radar clutter as a non-Gaussian process, more specifically, by using the spherically invariant random process (SIRP) model. In this paper, we focus on the estimation and performance analysis of the angular spacing between two targets for the MIMO radar under the SIRP clutter. First, we propose an iterative maximum likelihood as well as an iterative maximum a posteriori estimator, for the target's spacing parameter estimation in the SIRP clutter context. Then we derive and compare various Cramér-Rao-like bounds (CRLBs) for performance assessment. Finally, we address the problem of target resolvability by using the concept of angular resolution limit (ARL), and derive an analytical, closed-form expression of the ARL based on Smith's criterion, between two closely spaced targets in a MIMO radar context under SIRP clutter. For this aim we also obtain the non-matrix, closed-form expressions for each of the CRLBs. Finally, we provide numerical simulations to assess the performance of the proposed algorithms, the validity of the derived ARL expression, and to reveal the ARL's insightful properties.
- The maximum likelihood (ML) and maximum a posteriori (MAP) estimation techniques are widely used to address the direction-of-arrival (DOA) estimation problems, an important topic in sensor array processing. Conventionally the ML estimators in the DOA estimation context assume the sensor noise to follow a Gaussian distribution. In real-life application, however, this assumption is sometimes not valid, and it is often more accurate to model the noise as a non-Gaussian process. In this paper we derive an iterative ML as well as an iterative MAP estimation algorithm for the DOA estimation problem under the spherically invariant random process noise assumption, one of the most popular non-Gaussian models, especially in the radar context. Numerical simulation results are provided to assess our proposed algorithms and to show their advantage in terms of performance over the conventional ML algorithm.
- This paper proposes a bootstrap-assisted procedure to conduct simultaneous inference for high dimensional sparse linear models based on the recent de-sparsifying Lasso estimator (van de Geer et al. 2014). Our procedure allows the dimension of the parameter vector of interest to be exponentially larger than sample size, and it automatically accounts for the dependence within the de-sparsifying Lasso estimator. Moreover, our simultaneous testing method can be naturally coupled with the margin screening (Fan and Lv 2008) to enhance its power in sparse testing with a reduced computational cost, or with the step-down method (Romano and Wolf 2005) to provide a strong control for the family-wise error rate. In theory, we prove that our simultaneous testing procedure asymptotically achieves the pre-specified significance level, and enjoys certain optimality in terms of its power even when the model errors are non-Gaussian. Our general theory is also useful in studying the support recovery problem. To broaden the applicability, we further extend our main results to generalized linear models with convex loss functions. The effectiveness of our methods is demonstrated via simulation studies.
- In this paper, we provide a general method to obtain the exact solutions of the degree distributions for RBDN with network size decline. First by stochastic process rules, the steady state transformation equations and steady state degree distribution equations are given in the case of m>2, 0<p<1/2 , then the average degree of network with n nodes is introduced to calculate the degree distribution. Especially, taking m=3 as an example, we explain the detailed solving process, in which computer simulation is used to verify our degree distribution solutions. In addition, the tail characteristics of the degree distribution are discussed. Our findings suggest that the degree distributions will exhibit Poisson tail property for the declining RBDN.
- Sep 29 2015 stat.ME arXiv:1509.08444v2Motivated by the likelihood ratio test under the Gaussian assumption, we develop a maximum sum-of-squares test for conducting hypothesis testing on high dimensional mean vector. The proposed test which incorporates the dependence among the variables is designed to ease the computational burden and to maximize the asymptotic power in the likelihood ratio test. A simulation-based approach is developed to approximate the sampling distribution of the test statistic. The validity of the testing procedure is justified under both the null and alternative hypotheses. We further extend the main results to the two sample problem without the equal covariance assumption. Numerical results suggest that the proposed test can be more powerful than some existing alternatives.
- Motivated by the need to statistically quantify the difference between two spatio-temporal datasets that arise in climate downscaling studies, we propose new tests to detect the differences of the covariance operators and their associated characteristics of two functional time series. Our two sample tests are constructed on the basis of functional principal component analysis and self-normalization, the latter of which is a new studentization technique recently developed for the inference of a univariate time series. Compared to the existing tests, our SN-based tests allow for weak dependence within each sample and it is robust to the dependence between the two samples in the case of equal sample sizes. Asymptotic properties of the SN-based test statistics are derived under both the null and local alternatives. Through extensive simulations, our SN-based tests are shown to outperform existing alternatives in size and their powers are found to be respectable. The tests are then applied to the gridded climate model outputs and interpolated observations to detect the difference in their spatial dynamics.
- Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
- There has been an increasing interest in testing the equality of large Pearson's correlation matrices. However, in many applications it is more important to test the equality of large rank-based correlation matrices since they are more robust to outliers and nonlinearity. Unlike the Pearson's case, testing the equality of large rank-based statistics has not been well explored and requires us to develop new methods and theory. In this paper, we provide a framework for testing the equality of two large U-statistic based correlation matrices, which include the rank-based correlation matrices as special cases. Our approach exploits extreme value statistics and the Jackknife estimator for uncertainty assessment and is valid under a fully nonparametric model. Theoretically, we develop a theory for testing the equality of U-statistic based correlation matrices. We then apply this theory to study the problem of testing large Kendall's tau correlation matrices and demonstrate its optimality. For proving this optimality, a novel construction of least favourable distributions is developed for the correlation matrix comparison.
- Feb 02 2015 stat.ME arXiv:1501.07815v1Aiming at abundant scientific and engineering data with not only high dimensionality but also complex structure, we study the regression problem with a multidimensional array (tensor) response and a vector predictor. Applications include, among others, comparing tensor images across groups after adjusting for additional covariates, which is of central interest in neuroimaging analysis. We propose parsimonious tensor response regression adopting a generalized sparsity principle. It models all voxels of the tensor response jointly, while accounting for the inherent structural information among the voxels. It effectively reduces the number of free parameters, leading to feasible computation and improved interpretation. We achieve model estimation through a nascent technique called the envelope method, which identifies the immaterial information and focuses the estimation based upon the material information in the tensor response. We demonstrate that the resulting estimator is asymptotically efficient, and it enjoys a competitive finite sample performance. We also illustrate the new method on two real neuroimaging studies.
- Dec 23 2014 stat.ME arXiv:1412.6592v1In an increasing number of neuroimaging studies, brain images, which are in the form of multidimensional arrays (tensors), have been collected on multiple subjects at multiple time points. Of scientific interest is to analyze such massive and complex longitudinal images to diagnose neurodegenerative disorders and to identify disease relevant brain regions. In this article, we treat those problems in a unifying regression framework with image predictors, and propose tensor generalized estimating equations (GEE) for longitudinal imaging analysis. The GEE approach takes into account intra-subject correlation of responses, whereas a low rank tensor decomposition of the coefficient array enables effective estimation and prediction with limited sample size. We propose an efficient estimation algorithm, study the asymptotics in both fixed $p$ and diverging $p$ regimes, and also investigate tensor GEE with regularization that is particularly useful for region selection. The efficacy of the proposed tensor GEE is demonstrated on both simulated data and a real data set from the Alzheimer's Disease Neuroimaging Initiative (ADNI).
- In (\citezhang2014nonlinear,zhang2014nonlinear2), we have viewed machine learning as a coding and dimensionality reduction problem, and further proposed a simple unsupervised dimensionality reduction method, entitled deep distributed random samplings (DDRS). In this paper, we further extend it to supervised learning incrementally. The key idea here is to incorporate label information into the coding process by reformulating that each center in DDRS has multiple output units indicating which class the center belongs to. The supervised learning method seems somewhat similar with random forests (\citebreiman2001random), here we emphasize their differences as follows. (i) Each layer of our method considers the relationship between part of the data points in training data with all training data points, while random forests focus on building each decision tree on only part of training data points independently. (ii) Our method builds gradually-narrowed network by sampling less and less data points, while random forests builds gradually-narrowed network by merging subclasses. (iii) Our method is trained more straightforward from bottom layer to top layer, while random forests build each tree from top layer to bottom layer by splitting. (iv) Our method encodes output targets implicitly in sparse codes, while random forests encode output targets by remembering the class attributes of the activated nodes. Therefore, our method is a simpler, more straightforward, and maybe a better alternative choice, though both methods use two very basic elements---randomization and nearest neighbor optimization---as the core. This preprint is used to protect the incremental idea from (\citezhang2014nonlinear,zhang2014nonlinear2). Full empirical evaluation will be announced carefully later.
- We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.
- Structured sparsity is an important modeling tool that expands the applicability of convex formulations for data analysis, however it also creates significant challenges for efficient algorithm design. In this paper we investigate the generalized conditional gradient (GCG) algorithm for solving structured sparse optimization problems---demonstrating that, with some enhancements, it can provide a more efficient alternative to current state of the art approaches. After providing a comprehensive overview of the convergence properties of GCG, we develop efficient methods for evaluating polar operators, a subroutine that is required in each GCG iteration. In particular, we show how the polar operator can be efficiently evaluated in two important scenarios: dictionary learning and structured sparse estimation. A further improvement is achieved by interleaving GCG with fixed-rank local subspace optimization. A series of experiments on matrix completion, multi-class classification, multi-view dictionary learning and overlapping group lasso shows that the proposed method can significantly reduce the training cost of current alternatives.
- Multilayer bootstrap network builds a gradually narrowed multilayer nonlinear network from bottom up for unsupervised nonlinear dimensionality reduction. Each layer of the network is a group of k-centers clusterings. Each clustering uses randomly sampled data points with randomly selected features as its centers, and learns a one-of-k encoding by one-nearest-neighbor optimization. Thanks to the binarized encoding, the similarity of two data points is measured by the number of the nearest centers they share in common, which is an adaptive similarity metric in the discrete space that needs no model assumption and parameter tuning. Thanks to the network structure, larger and larger local variations of data are gradually reduced from bottom up. The information loss caused by the binarized encoding is proportional to the correlation of the clusterings, both of which are reduced by the randomization steps.
- Aug 04 2014 stat.AP arXiv:1408.0095v1We develop a novel peak detection algorithm for the analysis of comprehensive two-dimensional gas chromatography time-of-flight mass spectrometry (GC$\times$GC-TOF MS) data using normal-exponential-Bernoulli (NEB) and mixture probability models. The algorithm first performs baseline correction and denoising simultaneously using the NEB model, which also defines peak regions. Peaks are then picked using a mixture of probability distribution to deal with the co-eluting peaks. Peak merging is further carried out based on the mass spectral similarities among the peaks within the same peak group. The algorithm is evaluated using experimental data to study the effect of different cutoffs of the conditional Bayes factors and the effect of different mixture models including Poisson, truncated Gaussian, Gaussian, Gamma and exponentially modified Gaussian (EMG) distributions, and the optimal version is introduced using a trial-and-error approach. We then compare the new algorithm with two existing algorithms in terms of compound identification. Data analysis shows that the developed algorithm can detect the peaks with lower false discovery rates than the existing algorithms, and a less complicated peak picking model is a promising alternative to the more complicated and widely used EMG mixture models.
- Many machine learning algorithms minimize a regularized risk, and stochastic optimization is widely used for this task. When working with massive data, it is desirable to perform stochastic optimization in parallel. Unfortunately, many existing stochastic optimization algorithms cannot be parallelized efficiently. In this paper we show that one can rewrite the regularized risk minimization problem as an equivalent saddle-point problem, and propose an efficient distributed stochastic optimization (DSO) algorithm. We prove the algorithm's rate of convergence; remarkably, our analysis shows that the algorithm scales almost linearly with the number of processors. We also verify with empirical evaluations that the proposed algorithm is competitive with other parallel, general purpose stochastic and batch optimization algorithms for regularized risk minimization.
- This article studies bootstrap inference for high dimensional weakly dependent time series in a general framework of approximately linear statistics. The following high dimensional applications are covered: (1) uniform confidence band for mean vector; (2) specification testing on the second order property of time series such as white noise testing and bandedness testing of covariance matrix; (3) specification testing on the spectral property of time series. In theory, we first derive a Gaussian approximation result for the maximum of a sum of weakly dependent vectors, where the dimension of the vectors is allowed to be exponentially larger than the sample size. In particular, we illustrate an interesting interplay between dependence and dimensionality, and also discuss one type of "dimension free" dependence structure. We further propose a blockwise multiplier (wild) bootstrap that works for time series with unknown autocovariance structure. These distributional approximation errors, which are finite sample valid, decrease polynomially in sample size. A non-overlapping block bootstrap is also studied as a more flexible alternative. The above results are established under the general physical/functional dependence framework proposed in Wu (2005). Our work can be viewed as a substantive extension of Chernozhukov et al. (2013) to time series based on a variant of Stein's method developed therein.
- Mar 18 2014 stat.ME arXiv:1403.4138v1Envelopes were recently proposed as methods for reducing estimative variation in multivariate linear regression. Estimation of an envelope usually involves optimization over Grassmann manifolds. We propose a fast and widely applicable one-dimensional (1D) algorithm for estimating an envelope in general. We reveal an important structural property of envelopes that facilitates our algorithm, and we prove both Fisher consistency and root-n-consistency of the algorithm.
- In this paper we consider the problem of graph-based transductive classification, and we are particularly interested in the directed graph scenario which is a natural form for many real world applications. Different from existing research efforts that either only deal with undirected graphs or circumvent directionality by means of symmetrization, we propose a novel random walk approach on directed graphs using absorbing Markov chains, which can be regarded as maximizing the accumulated expected number of visits from the unlabeled transient states. Our algorithm is simple, easy to implement, and works with large-scale graphs. In particular, it is capable of preserving the graph structure even when the input graph is sparse and changes over time, as well as retaining weak signals presented in the directed edges. We present its intimate connections to a number of existing methods, including graph kernels, graph Laplacian based methods, and interestingly, spanning forest of graphs. Its computational complexity and the generalization error are also studied. Empirically our algorithm is systematically evaluated on a wide range of applications, where it has shown to perform competitively comparing to a suite of state-of-the-art methods.
- Jan 21 2014 stat.ME arXiv:1401.5002v2The upper bounds on the coverage probabilities of the confidence regions based on blockwise empirical likelihood [Kitamura (1997)] and nonstandard expansive empirical likelihood [Nordman et al. (2013)] methods for time series data are investigated via studying the probability for the violation of the convex hull constraint. The large sample bounds are derived on the basis of the pivotal limit of the blockwise empirical log-likelihood ratio obtained under the fixed-b asymptotics, which has been recently shown to provide a more accurate approximation to the finite sample distribution than the conventional chi-square approximation. Our theoretical and numerical findings suggest that both the finite sample and large sample upper bounds for coverage probabilities are strictly less than one and the blockwise empirical likelihood confidence region can exhibit serious undercoverage when (i) the dimension of moment conditions is moderate or large; (ii) the time series dependence is positively strong; or (iii) the block size is large relative to sample size. A similar finite sample coverage problem occurs for the nonstandard expansive empirical likelihood. To alleviate the coverage bound problem, we propose to penalize both empirical likelihood methods by relaxing the convex hull constraint. Numerical simulations and data illustration demonstrate the effectiveness of our proposed remedies in terms of delivering confidence sets with more accurate coverage.
- Although many convex relaxations of clustering have been proposed in the past decade, current formulations remain restricted to spherical Gaussian or discriminative models and are susceptible to imbalanced clusters. To address these shortcomings, we propose a new class of convex relaxations that can be flexibly applied to more general forms of Bregman divergence clustering. By basing these new formulations on normalized equivalence relations we retain additional control on relaxation quality, which allows improvement in clustering quality. We furthermore develop optimization methods that improve scalability by exploiting recent implicit matrix norm methods. In practice, we find that the new formulations are able to efficiently produce tighter clusterings that improve the accuracy of state of the art methods.
- Unsupervised deep learning is one of the most powerful representation learning techniques. Restricted Boltzman machine, sparse coding, regularized auto-encoders, and convolutional neural networks are pioneering building blocks of deep learning. In this paper, we propose a new building block -- distributed random models. The proposed method is a special full implementation of the product of experts: (i) each expert owns multiple hidden units and different experts have different numbers of hidden units; (ii) the model of each expert is a k-center clustering, whose k-centers are only uniformly sampled examples, and whose output (i.e. the hidden units) is a sparse code that only the similarity values from a few nearest neighbors are reserved. The relationship between the pioneering building blocks, several notable research branches and the proposed method is analyzed. Experimental results show that the proposed deep model can learn better representations than deep belief networks and meanwhile can train a much larger network with much less time than deep belief networks.
- In the next decade, a number of experiments will attempt to determine the neutrino mass hierarchy. Feasibility studies for such experiments generally determine the expected value of Delta chi^2. As the hierarchy is a discrete choice, Delta chi^2 does not obey a one degree of freedom chi^2 distribution and so the number of sigmas of confidence of the hierarchy determination is not the square root of the expected Delta chi^2. We present a simple formula for the sensitivity to the hierarchy that can be expected from the median experiment as a function of the expected Delta chi^2.
- Apr 02 2013 stat.CO arXiv:1304.0151v1In the following article we develop a particle filter for approximating Feynman-Kac models with indicator potentials. Examples of such models include approximate Bayesian computation (ABC) posteriors associated with hidden Markov models (HMMs) or rare-event problems. Such models require the use of advanced particle filter or Markov chain Monte Carlo (MCMC) algorithms e.g. Jasra et al. (2012), to perform estimation. One of the drawbacks of existing particle filters, is that they may 'collapse', in that the algorithm may terminate early, due to the indicator potentials. In this article, using a special case of the locally adaptive particle filter in Lee et al. (2013), which is closely related to Le Gland & Oudjane (2004), we use an algorithm which can deal with this latter problem, whilst introducing a random cost per-time step. This algorithm is investigated from a theoretical perspective and several results are given which help to validate the algorithms and to provide guidelines for their implementation. In addition, we show how this algorithm can be used within MCMC, using particle MCMC (Andrieu et al. 2010). Numerical examples are presented for ABC approximations of HMMs.
- Recently, the deep-belief-networks (DBN) based voice activity detection (VAD) has been proposed. It is powerful in fusing the advantages of multiple features, and achieves the state-of-the-art performance. However, the deep layers of the DBN-based VAD do not show an apparent superiority to the shallower layers. In this paper, we propose a denoising-deep-neural-network (DDNN) based VAD to address the aforementioned problem. Specifically, we pre-train a deep neural network in a special unsupervised denoising greedy layer-wise mode, and then fine-tune the whole network in a supervised way by the common back-propagation algorithm. In the pre-training phase, we take the noisy speech signals as the visible layer and try to extract a new feature that minimizes the reconstruction cross-entropy loss between the noisy speech signals and its corresponding clean speech signals. Experimental results show that the proposed DDNN-based VAD not only outperforms the DBN-based VAD but also shows an apparent performance improvement of the deep layers over shallower layers.
- We develop a Bayesian nonparametric model for reconstructing magnetic resonance images (MRI) from highly undersampled k-space data. We perform dictionary learning as part of the image reconstruction process. To this end, we use the beta process as a nonparametric dictionary learning prior for representing an image patch as a sparse combination of dictionary elements. The size of the dictionary and the patch-specific sparsity pattern are inferred from the data, in addition to other dictionary learning variables. Dictionary learning is performed directly on the compressed image, and so is tailored to the MRI being considered. In addition, we investigate a total variation penalty term in combination with the dictionary learning model, and show how the denoising property of dictionary learning removes dependence on regularization parameters in the noisy setting. We derive a stochastic optimization algorithm based on Markov Chain Monte Carlo (MCMC) for the Bayesian model, and use the alternating direction method of multipliers (ADMM) for efficiently performing total variation minimization. We present empirical results on several MRI, which show that the proposed regularization framework can improve reconstruction accuracy over other methods.
- Jan 15 2013 stat.CO arXiv:1301.2897v1In this article we propose an improvement on the sequential updating and greedy search (SUGS) algorithm Wang and Dunson for fast fitting of Dirichlet process mixture models. The SUGS algorithm provides a means for very fast approximate Bayesian inference for mixture data which is particularly of use when data sets are so large that many standard Markov chain Monte Carlo (MCMC) algorithms cannot be applied efficiently, or take a prohibitively long time to converge. In particular, these ideas are used to initially interrogate the data, and to refine models such that one can potentially apply exact data analysis later on. SUGS relies upon sequentially allocating data to clusters and proceeding with an update of the posterior on the subsequent allocations and parameters which assumes this allocation is correct. Our modification softens this approach, by providing a probability distribution over allocations, with a similar computational cost; this approach has an interpretation as a variational Bayes procedure and hence we term it variational SUGS (VSUGS). It is shown in simulated examples that VSUGS can out-perform, in terms of density estimation and classification, the original SUGS algorithm in many scenarios. In addition, we present a data analysis for flow cytometry data, and SNP data via a three-class dirichlet process mixture model illustrating the apparent improvement over SUGS.
- A tree-based dictionary learning model is developed for joint analysis of imagery and associated text. The dictionary learning may be applied directly to the imagery from patches, or to general feature vectors extracted from patches or superpixels (using any existing method for image feature extraction). Each image is associated with a path through the tree (from root to a leaf), and each of the multiple patches in a given image is associated with one node in that path. Nodes near the tree root are shared between multiple paths, representing image characteristics that are common among different types of images. Moving toward the leaves, nodes become specialized, representing details in image classes. If available, words (text) are also jointly modeled, with a path-dependent probability over words. The tree structure is inferred via a nested Dirichlet process, and a retrospective stick-breaking sampler is used to infer the tree depth and width.
- Spatial-temporal linear model and the corresponding likelihood-based statistical inference are important tools for the analysis of spatial-temporal lattice data. In this paper, we study the asymptotic properties of maximum likelihood estimates under a general asymptotic framework for spatial-temporal linear models. We propose mild regularity conditions on the spatial-temporal weight matrices and derive the asymptotic properties (consistency and asymptotic normality) of maximum likelihood estimates. A simulation study is conducted to examine the finite-sample properties of the maximum likelihood estimates.
- We demonstrate that almost all non-parametric dimensionality reduction methods can be expressed by a simple procedure: regularized loss minimization plus singular value truncation. By distinguishing the role of the loss and regularizer in such a process, we recover a factored perspective that reveals some gaps in the current literature. Beyond identifying a useful new loss for manifold unfolding, a key contribution is to derive new convex regularizers that combine distance maximization with rank reduction. These regularizers can be applied to any loss.
- In this paper, we present new results on using orthogonal matching pursuit (OMP), to solve the sparse approximation problem over redundant dictionaries for complex cases (i.e., complex measurement vector, complex dictionary and complex additive white Gaussian noise (CAWGN)). A sufficient condition that OMP can recover the optimal representation of an exactly sparse signal in the complex cases is proposed both in noiseless and bound Gaussian noise settings. Similar to exact recovery condition (ERC) results in real cases, we extend them to complex case and derivate the corresponding ERC in the paper. It leverages this theory to show that OMP succeed for k-sparse signal from a class of complex dictionary. Besides, an application with geometrical theory of diffraction (GTD) model is presented for complex cases. Finally, simulation experiments illustrate the validity of the theoretical analysis.
- In this paper, we derive higher order Edgeworth expansions for the finite sample distributions of the subsampling-based t-statistic and the Wald statistic in the Gaussian location model under the so-called fixed-smoothing paradigm. In particular, we show that the error of asymptotic approximation is at the order of the reciprocal of the sample size and obtain explicit forms for the leading error terms in the expansions. The results are used to justify the second-order correctness of a new bootstrap method, the Gaussian dependent bootstrap, in the context of Gaussian location model.
- Apr 03 2012 stat.ME arXiv:1204.0286v2In modeling spatial extremes, the dependence structure is classically inferred by assuming that block maxima derive from max-stable processes. Weather stations provide daily records rather than just block maxima. The point process approach for univariate extreme value analysis, which uses more historical data and is preferred by some practitioners, does not adapt easily to the spatial setting. We propose a two-step approach with a composite likelihood that utilizes site-wise daily records in addition to block maxima. The procedure separates the estimation of marginal parameters and dependence parameters into two steps. The first step estimates the marginal parameters with an independence likelihood from the point process approach using daily records. Given the marginal parameter estimates, the second step estimates the dependence parameters with a pairwise likelihood using block maxima. In a simulation study, the two-step approach was found to be more efficient than the pairwise likelihood approach using only block maxima. The method was applied to study the effect of El Niño-Southern Oscillation on extreme precipitation in California with maximum daily winter precipitation from 35 sites over 55 years. Using site-specific generalized extreme value models, the two-step approach led to more sites detected with the El Niño effect, narrower confidence intervals for return levels and tighter confidence regions for risk measures of jointly defined events.
- A Support Vector Method for multivariate performance measures was recently introduced by Joachims (2005). The underlying optimization problem is currently solved using cutting plane methods such as SVM-Perf and BMRM. One can show that these algorithms converge to an eta accurate solution in O(1/Lambda*e) iterations, where lambda is the trade-off parameter between the regularizer and the loss function. We present a smoothing strategy for multivariate performance scores, in particular precision/recall break-even point and ROCArea. When combined with Nesterov's accelerated gradient algorithm our smoothing strategy yields an optimization algorithm which converges to an eta accurate solution in O(min1/e,1/sqrt(lambda*e)) iterations. Furthermore, the cost per iteration of our scheme is the same as that of SVM-Perf and BMRM. Empirical evaluation on a number of publicly available datasets shows that our method converges significantly faster than cutting plane methods without sacrificing generalization ability.
- We study model selection and model averaging in generalized additive partial linear models (GAPLMs). Polynomial spline is used to approximate nonparametric functions. The corresponding estimators of the linear parameters are shown to be asymptotically normal. We then develop a focused information criterion (FIC) and a frequentist model average (FMA) estimator on the basis of the quasi-likelihood principle and examine theoretical properties of the FIC and FMA. The major advantages of the proposed procedures over the existing ones are their computational expediency and theoretical reliability. Simulation experiments have provided evidence of the superiority of the proposed procedures. The approach is further applied to a real-world data example.
- A denoising technique based on noise invalidation is proposed. The adaptive approach derives a noise signature from the noise order statistics and utilizes the signature to denoise the data. The novelty of this approach is in presenting a general-purpose denoising in the sense that it does not need to employ any particular assumption on the structure of the noise-free signal, such as data smoothness or sparsity of the coefficients. An advantage of the method is in denoising the corrupted data in any complete basis transformation (orthogonal or non-orthogonal). Experimental results show that the proposed method, called Noise Invalidation Denoising (NIDe), outperforms existing denoising approaches in terms of Mean Square Error (MSE).
- In this article we prove the existence and uniqueness for degenerate stochastic differential equations with Sobolev (possibly singular) drift and diffusion coefficients in a generalized sense. In particular, our result covers the classical DiPerna-Lions flows and, we also obtain the well-posedness for degenerate Fokker-Planck equations with irregular coefficients. Moreover, a large deviation principle of Freidlin-Wenzell type for this type of SDEs is established.
- Jul 30 2008 stat.AP arXiv:0807.4672v1Quantitative Magnetic Resonance Imaging (qMRI) provides researchers insight into pathological and physiological alterations of living tissue, with the help of which researchers hope to predict (local) therapeutic efficacy early and determine optimal treatment schedule. However, the analysis of qMRI has been limited to ad-hoc heuristic methods. Our research provides a powerful statistical framework for image analysis and sheds light on future localized adaptive treatment regimes tailored to the individual's response. We assume in an imperfect world we only observe a blurred and noisy version of the underlying pathological/physiological changes via qMRI, due to measurement errors or unpredictable influences. We use a hidden Markov random field to model the spatial dependence in the data and develop a maximum likelihood approach via the Expectation--Maximization algorithm with stochastic variation. An important improvement over previous work is the assessment of variability in parameter estimation, which is the valid basis for statistical inference. More importantly, we focus on the expected changes rather than image segmentation. Our research has shown that the approach is powerful in both simulation studies and on a real dataset, while quite robust in the presence of some model assumption violations.