results for au:Chen_J in:stat

- We compare and contrast the statistical physics and quantum physics inspired approaches for unsupervised generative modeling of classical data. The two approaches represent probabilities of observed data using energy-based models and quantum states respectively.Classical and quantum information patterns of the target datasets therefore provide principled guidelines for structural design and learning in these two approaches. Taking the restricted Boltzmann machines (RBM) as an example, we analyze the information theoretical bounds of the two approaches. We verify our reasonings by comparing the performance of RBMs of various architectures on the standard MNIST datasets.
- In a number of cases, the Quantile Gaussian Process (QGP) has proven effective in emulating stochastic, univariate computer model output (Plumlee and Tuo, 2014). In this paper, we develop an approach that uses this emulation approach within a Bayesian model calibration framework to calibrate an agent-based model of an epidemic. In addition, this approach is extended to handle the multivariate nature of the model output, which gives a time series of the count of infected individuals. The basic modeling approach is adapted from Higdon et al. (2008), using a basis representation to capture the multivariate model output. The approach is motivated with an example taken from the 2015 Ebola Challenge workshop which simulated an ebola epidemic to evaluate methodology.
- Gaussian random fields (GRF) are a fundamental stochastic model for spatiotemporal data analysis. An essential ingredient of GRF is the covariance function that characterizes the joint Gaussian distribution of the field. Commonly used covariance functions give rise to fully dense and unstructured covariance matrices, for which required calculations are notoriously expensive to carry out for large data. In this work, we propose a construction of covariance functions that result in matrices with a hierarchical structure. Empowered by matrix algorithms that scale linearly with the matrix dimension, the hierarchical structure is proved to be efficient for a variety of random field computations, including sampling, kriging, and likelihood evaluation. Specifically, with $n$ scattered sites, sampling and likelihood evaluation has an $O(n)$ cost and kriging has an $O(\log n)$ cost after preprocessing, particularly favorable for the kriging of an extremely large number of sites (e.g., predicting on more sites than observed). We demonstrate comprehensive numerical experiments to show the use of the constructed covariance functions and their appealing computation time. Numerical examples on a laptop include simulated data of size up to one million, as well as a climate data product with over two million observations.
- Kernel PCA is a widely used nonlinear dimension reduction technique in machine learning, but storing the kernel matrix is notoriously challenging when the sample size is large. Inspired by [YPCC16], where the idea of partial matrix sampling followed by nonconvex optimization is proposed for matrix completion and robust PCA, we apply a similar approach to memory-efficient Kernel PCA. In theory, with no assumptions on the kernel matrix in terms of eigenvalues or eigenvectors, we established a model-free theory for the low-rank approximation based on any local minimum of the proposed objective function. As interesting byproducts, when the underlying positive semidefinite matrix is assumed to be low-rank and highly structured, corollaries of our main theorem improve the state-of-the-art results [GLM16, GJZ17] for nonconvex matrix completion with no spurious local minima. Numerical experiments also show that our approach is competitive in terms of approximation accuracy compared to the well-known Nyström algorithm for Kernel PCA.
- We propose a new localized inference algorithm for answering marginalization queries in large graphical models with the correlation decay property. Given a query variable and a large graphical model, we define a much smaller model in a local region around the query variable in the target model so that the marginal distribution of the query variable can be accurately approximated. We introduce two approximation error bounds based on the Dobrushin's comparison theorem and apply our bounds to derive a greedy expansion algorithm that efficiently guides the selection of neighbor nodes for localized inference. We verify our theoretical bounds on various datasets and demonstrate that our localized inference algorithm can provide fast and accurate approximation for large graphical models.
- Graph convolutional networks (GCNs) are powerful deep neural networks for graph-structured data. However, GCN computes nodes' representation recursively from their neighbors, making the receptive field size grow exponentially with the number of layers. Previous attempts on reducing the receptive field size by subsampling neighbors do not have any convergence guarantee, and their receptive field size per node is still in the order of hundreds. In this paper, we develop a preprocessing strategy and two control variate based algorithms to further reduce the receptive field size. Our algorithms are guaranteed to converge to GCN's local optimum regardless of the neighbor sampling size. Empirical results show that our algorithms have a similar convergence speed per epoch with the exact algorithm even using only two neighbors per node. The time consumption of our algorithm on the Reddit dataset is only one fifth of previous neighbor sampling algorithms.
- Principal component analysis (PCA) has well-documented merits for data extraction and dimensionality reduction. PCA deals with a single dataset at a time, and it is challenged when it comes to analyzing multiple datasets. Yet in certain setups, one wishes to extract the most significant information of one dataset relative to other datasets. Specifically, the interest may be on identifying, namely extracting features that are specific to a single target dataset but not the others. This paper develops a novel approach for such so-termed discriminative data analysis, and establishes its optimality in the least-squares (LS) sense under suitable data modeling assumptions. The criterion reveals linear combinations of variables by maximizing the ratio of the variance of the target data to that of the remainders. The novel approach solves a generalized eigenvalue problem by performing SVD just once. Numerical tests using synthetic and real datasets showcase the merits of the proposed approach relative to its competing alternatives.
- Oct 24 2017 stat.ML arXiv:1710.07804v1In this paper, we design and analyze a new zeroth-order online algorithm, namely, the zeroth-order online alternating direction method of multipliers (ZOO-ADMM), which enjoys dual advantages of being gradient-free operation and employing the ADMM to accommodate complex structured regularizers. Compared to the first-order gradient-based online algorithm, we show that ZOO-ADMM requires $\sqrt{m}$ times more iterations, leading to a convergence rate of $O(\sqrt{m}/\sqrt{T})$, where $m$ is the number of optimization variables, and $T$ is the number of iterations. To accelerate ZOO-ADMM, we propose two minibatch strategies: gradient sample averaging and observation averaging, resulting in an improved convergence rate of $O(\sqrt{1+q^{-1}m}/\sqrt{T})$, where $q$ is the minibatch size. In addition to convergence analysis, we also demonstrate ZOO-ADMM to applications in signal processing, statistics, and machine learning.
- Identifying arbitrary topologies of power networks in real time is a computationally hard problem due to the number of hypotheses that grows exponentially with the network size. A new "Learning-to-Infer" variational inference method is developed for efficient inference of every line status in the network. Optimizing the variational model is transformed to and solved as a discriminative learning problem based on Monte Carlo samples generated with power flow simulations. A major advantage of the developed Learning-to-Infer method is that the labeled data used for training can be generated in an arbitrarily large amount fast and at very little cost. As a result, the power of offline training is fully exploited to learn very complex classifiers for effective real-time topology identification. The proposed methods are evaluated in the IEEE 30, 118 and 300 bus systems. Excellent performance in identifying arbitrary power network topologies in real time is achieved even with relatively simple variational models and a reasonably small amount of data.
- We propose a top-down algorithm for multiple testing on directed acyclic graphs (DAGs), where nodes represent hypotheses and edges specify a partial ordering in which hypotheses must be tested. The procedure is guaranteed to reject a sub-DAG with bounded false discovery rate (FDR) while satisfying the logical constraint that a rejected node's parents must also be rejected. It is designed for sequential testing settings, when the DAG structure is known a priori, but the p-values are obtained selectively (such as sequential conduction of experiments), but the algorithm is also applicable in non-sequential settings when all p-values can be calculated in advance (such as variable/model selection). Our DAGGER algorithm, shorthand for Greedily Evolving Rejections on DAGs, allows for independence, positive or arbitrary dependence of the p-values, and is guaranteed to work on two different types of DAGs: (a) intersection DAGs in which all nodes are intersection hypotheses, with parents being supersets of children, or (b) general DAGs in which all nodes may be elementary hypotheses. The DAGGER procedure has the appealing property that it specializes to known algorithms in the special cases of trees and line graphs, and simplifies to the classic Benjamini-Hochberg procedure when the DAG has no edges. We explore the empirical performance of DAGGER using simulations, as well as a real dataset corresponding to a gene ontology DAG, showing that it performs favorably in terms of time and power.
- In this paper we introduce ZhuSuan, a python probabilistic programming library for Bayesian deep learning, which conjoins the complimentary advantages of Bayesian methods and deep learning. ZhuSuan is built upon Tensorflow. Unlike existing deep learning libraries, which are mainly designed for deterministic neural networks and supervised tasks, ZhuSuan is featured for its deep root into Bayesian inference, thus supporting various kinds of probabilistic models, including both the traditional hierarchical Bayesian models and recent deep generative models. We use running examples to illustrate the probabilistic programming on ZhuSuan, including Bayesian logistic regression, variational auto-encoders, deep sigmoid belief networks and Bayesian recurrent neural networks.
- In 2015, Barber and Candes introduced a new variable selection procedure called the knockoff filter to control the false discovery rate (FDR) and prove that this method achieves exact FDR control. Inspired by the work of Barber and Candes (2015), we propose and analyze a pseudo-knockoff filter that inherits some advantages of the original knockoff filter and has more flexibility in constructing its knockoff matrix. Although we have not been able to obtain exact FDR control of the pseudo knockoff filter, we show that it satisfies an expectation inequality that offers some insight into FDR control. Moreover, we provide some partial analysis of the pseudo knockoff filter for the half Lasso and the least squares statistics. Our analysis indicates that the inverse of the covariance matrix of the feature matrix plays an important role in designing and analyzing the pseudo knockoff filter. Our preliminary numerical experiments show that the pseudo knockoff filter with the half Lasso statistic has FDR control. Moreover, our numerical experiments show that the pseudo-knockoff filter could offer more power than the original knockoff filter with the OMP or Lasso Path statistic when the features are correlated and non-sparse.
- We present a unified framework to analyze the global convergence of Langevin dynamics based algorithms for nonconvex finite-sum optimization with $n$ component functions. At the core of our analysis is a new decomposition scheme of the optimization error, under which we directly analyze the ergodicity of the numerical approximations of Langevin dynamics and prove sharp convergence rates. We establish the first global convergence guarantee of gradient Langevin dynamics (GLD) with iteration complexity $O\big(1/\epsilon \cdot\log(1/\epsilon)\big)$. In addition, we improve the convergence rate of stochastic gradient Langevin dynamics (SGLD) to the "almost minimizer", which does not depend on the undesirable uniform spectral gap introduced in previous studies. Furthermore, we for the first time prove the global convergence guarantee of variance reduced stochastic gradient Langevin dynamics (VR-SGLD) with iteration complexity $O\big(m/(B\epsilon^3)\cdot\log(1/\epsilon)\big)$, where $B$ is the mini-batch size and $m$ is the length of the inner loop. We show that the gradient complexity of VR-SGLD is $O\big(n^{1/2}/\epsilon^{3/2}\cdot\log(1/\epsilon)\big)$, which outperforms $O\big(n/\epsilon\cdot\log(1/\epsilon)\big)$ gradient complexity of GLD, when the number of component functions satisfies $n \geq 1/\epsilon$. Our theoretical analysis shed some light on using Langevin dynamics based algorithms for nonconvex optimization with provable guarantees.
- We study testing high-dimensional covariance matrices when data exhibit heteroskedasticity. The observations are modeled as $\mathbf Y_i=\omega_i\mathbf Z_i$, where $\mathbf Z_i$'s are i.i.d. $p$-dimensional random vectors with mean $\textbf{0}$ and covariance $\boldsymbol{\Sigma}$, and $\omega_i$'s are random scalars reflecting heteroskedasticity. The model is an extension of the elliptical distribution, and accommodates several stylized facts of real data including heteroskedasticity, heavy-tailedness, asymmetry, etc. We aim to test $H_{0}: \boldsymbol{\Sigma}\propto\boldsymbol{\Sigma}_0$, in the high-dimensional setting where both the dimension $p$ and the sample size $n$ grow to infinity proportionally. We remove the heteroskedasticity by self-normalizing the observations, and establish a CLT for the linear spectral statistic (LSS) of $\widetilde{\mathbf S}_n:=\frac{p}{n}\sum_{i=1}^n \mathbf Y_i\mathbf Y_i^T/|\mathbf Y_i|^2=\frac{p}{n}\sum_{i=1}^n \mathbf Z_i\mathbf Z_i^T/|\mathbf Z_i|^2$. The CLT is different from the existing ones for the LSS of the usual sample covariance matrix $\mathbf S_n:=\frac{1}{n}\sum_{i=1}^n \mathbf Z_i\mathbf Z_i^T $ (Bai and Silverstein[Ann. Probab. 32(2004) 553--605], Najim and Yao [Ann. Appl. Probab. 26(2016) 1837--1887]). Our tests based on the new CLT neither assume a specific parametric distribution nor involve the fourth moment of $\mathbf Z_i$. Numerical studies show that our tests work well even when $\mathbf Z_i$'s are heavy-tailed.
- We propose a framework for feature selection that employs kernel-based measures of independence to find a subset of covariates that is maximally predictive of the response. Building on past work in kernel dimension reduction, we formulate our approach as a constrained optimization problem involving the trace of the conditional covariance operator, and additionally provide some consistency results. We then demonstrate on a variety of synthetic and real data sets that our method compares favorably with other state-of-the-art algorithms.
- We present an efficient score statistic to detect the emergence of a spatially and temporally correlated signal, which is called the $\textsf{S}^3T$ statistic. The signal may cause a mean shift, and (or) a change in the covariance structure. The score statistic can capture both spatial and temporal change and hence is particularly powerful in detecting weak signals. Our score method is computationally efficient and statistically powerful. The main theoretical contribution is an accurate analytical approximation on the false alarm rate of the detection procedure, which can be used to calibrate a threshold analytically. Simulated and real data examples demonstrate the good performance of our procedure.
- In many applications, we need to study a linear regression model that consists of a response variable and a large number of potential explanatory variables and determine which variables are truly associated with the response. In 2015, Barber and Candes introduced a new variable selection procedure called the knockoff filter to control the false discovery rate (FDR) and proved that this method achieves exact FDR control. In this paper, we provide some analysis of the knockoff filter and its variants. Based on our analysis, we propose a PCA prototype group selection filter that has exact group FDR control and several advantages over existing group selection methods for strongly correlated features. Another contribution is that we propose a new noise estimator that can be incorporated into the knockoff statistic from a penalized method without violating the exchangeability property. Our analysis also reveals that some knockoff statistics, including the Lasso path and the marginal correlation statistics, suffer from the alternating sign effect. To overcome this deficiency, we introduce the notion of a good statistic and propose several alternative statistics that take advantage of the good statistic property. Finally, we present a number of numerical experiments to demonstrate the effectiveness of our methods and confirm our analysis.
- Jun 06 2017 stat.ME arXiv:1706.01426v1Learning with Reproducing Kernel Hilbert Spaces (RKHS) has been widely used in many scientific disciplines. Because a RKHS can be very flexible, it is common to impose a regularization term in the optimization to prevent overfitting. Standard RKHS learning employs the squared norm penalty of the learning function. Despite its success, many challenges remain. In particular, one cannot directly use the squared norm penalty for variable selection or data extraction. Therefore, when there exists noise predictors, or the underlying function has a sparse representation in the dual space, the performance of standard RKHS learning can be suboptimal. In the literature,work has been proposed on how to perform variable selection in RKHS learning, and a data sparsity constraint was considered for data extraction. However, how to learn in a RKHS with both variable selection and data extraction simultaneously remains unclear. In this paper, we propose a unified RKHS learning method, namely, DOuble Sparsity Kernel (DOSK) learning, to overcome this challenge. An efficient algorithm is provided to solve the corresponding optimization problem. We prove that under certain conditions, our new method can asymptotically achieve variable selection consistency. Simulated and real data results demonstrate that DOSK is highly competitive among existing approaches for RKHS learning.
- This paper deals with finding an $n$-dimensional solution $x$ to a system of quadratic equations of the form $y_i=|\langle{a}_i,x\rangle|^2$ for $1\le i \le m$, which is also known as phase retrieval and is NP-hard in general. We put forth a novel procedure for minimizing the amplitude-based least-squares empirical loss, that starts with a weighted maximal correlation initialization obtainable with a few power or Lanczos iterations, followed by successive refinements based upon a sequence of iteratively reweighted (generalized) gradient iterations. The two (both the initialization and gradient flow) stages distinguish themselves from prior contributions by the inclusion of a fresh (re)weighting regularization technique. The overall algorithm is conceptually simple, numerically scalable, and easy-to-implement. For certain random measurement models, the novel procedure is shown capable of finding the true solution $x$ in time proportional to reading the data $\{(a_i;y_i)\}_{1\le i \le m}$. This holds with high probability and without extra assumption on the signal $x$ to be recovered, provided that the number $m$ of equations is some constant $c>0$ times the number $n$ of unknowns in the signal vector, namely, $m>cn$. Empirically, the upshots of this contribution are: i) (almost) $100\%$ perfect signal recovery in the high-dimensional (say e.g., $n\ge 2,000$) regime given only an information-theoretic limit number of noiseless equations, namely, $m=2n-1$ in the real-valued Gaussian case; and, ii) (nearly) optimal statistical accuracy in the presence of additive noise of bounded support. Finally, substantial numerical tests using both synthetic data and real images corroborate markedly improved signal recovery performance and computational efficiency of our novel procedure relative to state-of-the-art approaches.
- May 30 2017 stat.ME arXiv:1705.10063v1Sample surveys are widely used to obtain information about totals, means, medians, and other parameters of finite populations. In many applications, similar information is desired for subpopulations such as individuals in specific geographic areas and socio-demographic groups. When the surveys are conducted at national or similarly high levels, a probability sampling can result in just a few sampling units from many unplanned subpopulations at the design stage. Cost considerations may also lead to low sample sizes from individual small areas. Estimating the parameters of these subpopulations with satisfactory precision and evaluating their accuracy are serious challenges for statisticians. To overcome the difficulties, statisticians resort to pooling information across the small areas via suitable model assumptions, administrative archives, and census data. In this paper, we develop an array of small area quantile estimators. The novelty is the introduction of a semiparametric density ratio model for the error distribution in the unit-level nested error regression model. In contrast, the existing methods are usually most effective when the response values are jointly normal. We also propose a resampling procedure for estimating the mean square errors of these estimators. Simulation results indicate that the new methods have superior performance when the population distributions are skewed and remain competitive otherwise.
- For various applications, the relations between the dependent and independent variables are highly nonlinear. Consequently, for large scale complex problems, neural networks and regression trees are commonly preferred over linear models such as Lasso. This work proposes learning the feature nonlinearities by binning feature values and finding the best fit in each quantile using non-convex regularized linear regression. The algorithm first captures the dependence between neighboring quantiles by enforcing smoothness via piecewise-constant/linear approximation and then selects a sparse subset of good features. We prove that the proposed algorithm is statistically and computationally efficient. In particular, it achieves linear rate of convergence while requiring near-minimal number of samples. Evaluations on synthetic and real datasets demonstrate that algorithm is competitive with current state-of-the-art and accurately learns feature nonlinearities. Finally, we explore an interesting connection between the binning stage of our algorithm and sparse Johnson-Lindenstrauss matrices.
- In this chapter, we analyze nonlinear filtering problems in distributed environments, e.g., sensor networks or peer-to-peer protocols. In these scenarios, the agents in the environment receive measurements in a streaming fashion, and they are required to estimate a common (nonlinear) model by alternating local computations and communications with their neighbors. We focus on the important distinction between single-task problems, where the underlying model is common to all agents, and multitask problems, where each agent might converge to a different model due to, e.g., spatial dependencies or other factors. Currently, most of the literature on distributed learning in the nonlinear case has focused on the single-task case, which may be a strong limitation in real-world scenarios. After introducing the problem and reviewing the existing approaches, we describe a simple kernel-based algorithm tailored for the multitask case. We evaluate the proposal on a simulated benchmark task, and we conclude by detailing currently open problems and lines of research.
- We consider the phase retrieval problem of recovering the unknown signal from the magnitude-only measurements, where the measurements can be contaminated by both sparse arbitrary corruption and bounded random noise. We propose a new nonconvex algorithm for robust phase retrieval, namely Robust Wirtinger Flow, to jointly estimate the unknown signal and the sparse corruption. We show that our proposed algorithm is guaranteed to converge linearly to the unknown true signal up to a minimax optimal statistical precision in such a challenging setting. Compared with existing robust phase retrieval methods, we improved the statistical error rate by a factor of $\sqrt{n/m}$ where $n$ is the dimension of the signal and $m$ is the sample size, provided a refined characterization of the corruption fraction requirement, and relaxed the lower bound condition on the number of corruption. In the noise-free case, our algorithm converges to the unknown signal at a linear rate and achieves optimal sample complexity up to a logarithm factor. Thorough experiments on both synthetic and real datasets corroborate our theory.
- Multi-parameter one-sided hypothesis test problems arise naturally in many applications. We are particularly interested in effective tests for monitoring multiple quality indices in forestry products. Our search reveals that there are many effective statistical methods in the literature for normal data, and that they can easily be adapted for non-normal data. We find that the beautiful likelihood ratio test is unsatisfactory, because in order to control the size, it must cope with the least favorable distributions at the cost of power. In this paper, we find a novel way to slightly ease the size control, obtaining a much more powerful test. Simulation confirms that the new test retains good control of the type I error and is markedly more powerful than the likelihood ratio test as well as many competitors based on normal data. The new method performs well in the context of monitoring multiple quality indices.
- Policy evaluation is a crucial step in many reinforcement-learning procedures, which estimates a value function that predicts states' long-term value under a given policy. In this paper, we focus on policy evaluation with linear function approximation over a fixed dataset. We first transform the empirical policy evaluation problem into a (quadratic) convex-concave saddle point problem, and then present a primal-dual batch gradient method, as well as two stochastic variance reduction methods for solving the problem. These algorithms scale linearly in both sample size and feature dimension. Moreover, they achieve linear convergence even when the saddle-point problem has only strong concavity in the dual variables but no strong convexity in the primal variables. Numerical experiments on benchmark problems demonstrate the effectiveness of our methods.
- Nested Chinese Restaurant Process (nCRP) topic models are powerful nonparametric Bayesian methods to extract a topic hierarchy from a given text corpus, where the hierarchical structure is automatically determined by the data. Hierarchical Latent Dirichlet Allocation (hLDA) is a popular instance of nCRP topic models. However, hLDA has only been evaluated at small scale, because the existing collapsed Gibbs sampling and instantiated weight variational inference algorithms either are not scalable or sacrifice inference quality with mean-field assumptions. Moreover, an efficient distributed implementation of the data structures, such as dynamically growing count matrices and trees, is challenging. In this paper, we propose a novel partially collapsed Gibbs sampling (PCGS) algorithm, which combines the advantages of collapsed and instantiated weight algorithms to achieve good scalability as well as high model quality. An initialization strategy is presented to further improve the model quality. Finally, we propose an efficient distributed implementation of PCGS through vectorization, pre-processing, and a careful design of the concurrent data structures and communication strategy. Empirical studies show that our algorithm is 111 times more efficient than the previous open-source implementation for hLDA, with comparable or even better model quality. Our distributed implementation can extract 1,722 topics from a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than the previous largest corpus, with 50 machines in 7 hours.
- We test three common information criteria (IC) for selecting the order of a Hawkes process with an intensity kernel that can be expressed as a mixture of exponential terms. These processes find application in high-frequency financial data modelling. The information criteria are Akaike's information criterion (AIC), the Bayesian information criterion (BIC) and the Hannan-Quinn criterion (HQ). Since we work with simulated data, we are able to measure the performance of model selection by the success rate of the IC in selecting the model that was used to generate the data. In particular, we are interested in the relation between correct model selection and underlying sample size. The analysis includes realistic sample sizes and parameter sets from recent literature where parameters were estimated using empirical financial intra-day data. We compare our results to theoretical predictions and similar empirical findings on the asymptotic distribution of model selection for consistent and inconsistent IC.
- Feb 17 2017 stat.ME arXiv:1702.04755v1Precision medicine is an emerging scientific topic for disease treatment and prevention that takes into account individual patient characteristics. It is an important direction for clinical research, and many statistical methods have been recently proposed. One of the primary goals of precision medicine is to obtain an optimal individual treatment rule (ITR), which can help make decisions on treatment selection according to each patient's specific characteristics. Recently, outcome weighted learning (OWL) has been proposed to estimate such an optimal ITR in a binary treatment setting by maximizing the expected clinical outcome. However, for ordinal treatment settings, such as individualized dose finding, it is unclear how to use OWL. In this paper, we propose a new technique for estimating ITR with ordinal treatments. In particular, we propose a data duplication technique with a piecewise convex loss function. We establish Fisher consistency for the resulting estimated ITR under certain conditions, and obtain the convergence and risk bound properties. Simulated examples and two applications to datasets from an irritable bowel problem and a type 2 diabetes mellitus observational study demonstrate the highly competitive performance of the proposed method compared to existing alternatives.
- Online learning with streaming data in a distributed and collaborative manner can be useful in a wide range of applications. This topic has been receiving considerable attention in recent years with emphasis on both single-task and multitask scenarios. In single-task adaptation, agents cooperate to track an objective of common interest, while in multitask adaptation agents track multiple objectives simultaneously. Regularization is one useful technique to promote and exploit similarity among tasks in the latter scenario. This work examines an alternative way to model relations among tasks by assuming that they all share a common latent feature representation. As a result, a new multitask learning formulation is presented and algorithms are developed for its solution in a distributed online manner. We present a unified framework to analyze the mean-square-error performance of the adaptive strategies, and conduct simulations to illustrate the theoretical findings and potential applications.
- Restricted Boltzmann machine (RBM) is one of the fundamental building blocks of deep learning. RBM finds wide applications in dimensional reduction, feature extraction, and recommender systems via modeling the probability distributions of a variety of input data including natural images, speech signals, and customer ratings, etc. We build a bridge between RBM and tensor network states (TNS) widely used in quantum many-body physics research. We devise efficient algorithms to translate an RBM into the commonly used TNS. Conversely, we give sufficient and necessary conditions to determine whether a TNS can be transformed into an RBM of given architectures. Revealing these general and constructive connections can cross-fertilize both deep learning and quantum-many body physics. Notably, by exploiting the entanglement entropy bound of TNS, we can rigorously quantify the expressive power of RBM on complex datasets. Insights into TNS and its entanglement capacity can guide the design of more powerful deep learning architectures. On the other hand, RBM can represent quantum many-body states with fewer parameters compared to TNS, which may allow more efficient classical simulations.
- Jan 05 2017 stat.AP arXiv:1701.00900v1Cooperative geolocation has attracted significant research interests in recent years. A large number of localization algorithms rely on the availability of statistical knowledge of measurement errors, which is often difficult to obtain in practice. Compared with the statistical knowledge of measurement errors, it can often be easier to obtain the measurement error bound. This work investigates a localization problem assuming unknown measurement error distribution except for a bound on the error. We first formulate this localization problem as an optimization problem to minimize the worst-case estimation error, which is shown to be a non-convex optimization problem. Then, relaxation is applied to transform it into a convex one. Furthermore, we propose a distributed algorithm to solve the problem, which will converge in a few iterations. Simulation results show that the proposed algorithms are more robust to large measurement errors than existing algorithms in the literature. Geometrical analysis providing additional insights is also provided.
- Dec 22 2016 stat.ME arXiv:1612.07072v1This article addresses the problem of efficient Bayesian inference in dynamic systems using particle methods and makes a number of contributions. First, we develop a correlated pseudo-marginal (CPM) approach for Bayesian inference in state space (SS) models that is based on filtering the disturbances, rather than the states. This approach is useful when the state transition density is intractable or inefficient to compute, and also when the dimension of the disturbance is lower than the dimension of the state. Second, we propose a block pseudo-marginal (BPM) method that uses as the estimate of the likelihood the average of G independent unbiased estimates of the likelihood. We associate a set of underlying uniform of standard normal random numbers used to construct each of the individual unbiased likelihood estimates and then use component-wise Markov Chain Monte Carlo to update the parameter vector jointly with one set of these random numbers at a time. This induces a correlation of approximately 1-1/G between the logs of the estimated likelihood at the proposed and current values of the model parameters. Third, we show for some non-stationary state space models that the BPM approach is much more efficient than the CPM approach, because it is difficult to translate the high correlation in the underlying random numbers to high correlation between the logs of the likelihood estimates. Although our focus has been on applying the BPM method to state space models, our results and approach can be used in a wide range of applications of the PM method, such as panel data models, subsampling problems and approximate Bayesian computation.
- A novel approach termed \emphstochastic truncated amplitude flow (STAF) is developed to reconstruct an unknown $n$-dimensional real-/complex-valued signal $\bm{x}$ from $m$ `phaseless' quadratic equations of the form $\psi_i=|\langle\bm{a}_i,\bm{x}\rangle|$. This problem, also known as phase retrieval from magnitude-only information, is \emphNP-hard in general. Adopting an amplitude-based nonconvex formulation, STAF leads to an iterative solver comprising two stages: s1) Orthogonality-promoting initialization through a stochastic variance reduced gradient algorithm; and, s2) A series of iterative refinements of the initialization using stochastic truncated gradient iterations. Both stages involve a single equation per iteration, thus rendering STAF a simple, scalable, and fast approach amenable to large-scale implementations that is useful when $n$ is large. When $\{\bm{a}_i\}_{i=1}^m$ are independent Gaussian, STAF provably recovers exactly any $\bm{x}\in\mathbb{R}^n$ exponentially fast based on order of $n$ quadratic equations. STAF is also robust in the presence of additive noise of bounded support. Simulated tests involving real Gaussian $\{\bm{a}_i\}$ vectors demonstrate that STAF empirically reconstructs any $\bm{x}\in\mathbb{R}^n$ exactly from about $2.3n$ magnitude-only measurements, outperforming state-of-the-art approaches and narrowing the gap from the information-theoretic number of equations $m=2n-1$. Extensive experiments using synthetic data and real images corroborate markedly improved performance of STAF over existing alternatives.
- Positive-definite kernel functions are fundamental elements of kernel methods and Gaussian processes. A well-known construction of such functions comes from Bochner's characterization, which connects a positive-definite function with a probability distribution. Another construction, which appears to have attracted less attention, is Polya's criterion that characterizes a subset of these functions. In this paper, we study the latter characterization and derive a number of novel kernels little known previously. In the context of large-scale kernel machines, Rahimi and Recht (2007) proposed a random feature map (random Fourier) that approximates a kernel function, through independent sampling of the probability distribution in Bochner's characterization. The authors also suggested another feature map (random binning), which, although not explicitly stated, comes from Polya's characterization. We show that with the same number of random samples, the random binning map results in an Euclidean inner product closer to the kernel than does the random Fourier map. The superiority of the random binning map is confirmed empirically through regressions and classifications in the reproducing kernel Hilbert space.
- Oct 20 2016 stat.ME arXiv:1610.05809v1Factors such as climate change, forest fire and plague of insects, lead to concerns on the mechanical strength of plantation materials. To address such concerns, these products must be closely monitored. This leads to the need of updating lumber quality monitoring procedures in American Society for Testing and Materials (ASTM) Standard D1990 (adopted in 1991) from time to time. A key component of monitoring is an effective method for detecting the change in lower percentiles of the solid lumber strength based on multiple samples. In a recent study by Verrill et al.\ (2015), eight statistical tests proposed by wood scientists were examined thoroughly based on real and simulated data sets. These tests are found unsatisfactory in differing aspects such as seriously inflated false alarm rate when observations are clustered, suboptimal power properties, or having inconvenient ad hoc rejection regions. A contributing factor behind suboptimal performance is that most of these tests are not developed to detect the change in quantiles. In this paper, we use a nonparametric random effects model to handle the within cluster correlations, composite empirical likelihood to avoid explicit modelling of the correlations structure, and a density ratio model to combine the information from multiple samples. In addition, we propose a cluster-based bootstrapping procedure to construct the monitoring test on quantiles which satisfactorily controls the type I error in the presence of within cluster correlation. The performance of the test is examined through simulation experiments and a real world example. The new method is generally applicable, not confined to the motivating example.
- Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets and a large number of topics. Though distributed CPU systems have been used, GPU-based systems have emerged as a promising alternative because of the high computational power and memory bandwidth of GPUs. However, existing GPU-based LDA systems cannot support a large number of topics because they use algorithms on dense data structures whose time and space complexity is linear to the number of topics. In this paper, we propose SaberLDA, a GPU-based LDA system that implements a sparsity-aware algorithm to achieve sublinear time complexity and scales well to learn a large number of topics. To address the challenges introduced by sparsity, we propose a novel data layout, a new warp-based sampling kernel, and an efficient sparse count matrix updating algorithm that improves locality, makes efficient utilization of GPU warps, and reduces memory consumption. Experiments show that SaberLDA can learn from billions-token-scale data with up to 10,000 topics, which is almost two orders of magnitude larger than that of the previous GPU-based systems. With a single GPU card, SaberLDA is able to learn 10,000 topics from a dataset of billions of tokens in a few hours, which is only achievable with clusters with tens of machines before.
- There is a growing interest in joint multi-subject fMRI analysis. The challenge of such analysis comes from inherent anatomical and functional variability across subjects. One approach to resolving this is a shared response factor model. This assumes a shared and time synchronized stimulus across subjects. Such a model can often identify shared information, but it may not be able to pinpoint with high resolution the spatial location of this information. In this work, we examine a searchlight based shared response model to identify shared information in small contiguous regions (searchlights) across the whole brain. Validation using classification tasks demonstrates that we can pinpoint informative local regions.
- Finding the most effective way to aggregate multi-subject fMRI data is a long-standing and challenging problem. It is of increasing interest in contemporary fMRI studies of human cognition due to the scarcity of data per subject and the variability of brain anatomy and functional response across subjects. Recent work on latent factor models shows promising results in this task but this approach does not preserve spatial locality in the brain. We examine two ways to combine the ideas of a factor model and a searchlight based analysis to aggregate multi-subject fMRI data while preserving spatial locality. We first do this directly by combining a recent factor method known as a shared response model with searchlight analysis. Then we design a multi-view convolutional autoencoder for the same task. Both approaches preserve spatial locality and have competitive or better performance compared with standard searchlight analysis and the shared response model applied across the whole brain. We also report a system design to handle the computational challenge of training the convolutional autoencoder.
- We propose a novel class of kernels to alleviate the high computational cost of large-scale nonparametric learning with kernel methods. The proposed kernel is defined based on a hierarchical partitioning of the underlying data domain, where the Nyström method (a globally low-rank approximation) is married with a locally lossless approximation in a hierarchical fashion. The kernel maintains (strict) positive-definiteness. The corresponding kernel matrix admits a recursively off-diagonal low-rank structure, which allows for fast linear algebra computations. Suppressing the factor of data dimension, the memory and arithmetic complexities for training a regression or a classifier are reduced from $O(n^2)$ and $O(n^3)$ to $O(nr)$ and $O(nr^2)$, respectively, where $n$ is the number of training examples and $r$ is the rank on each level of the hierarchy. Although other randomized approximate kernels entail a similar complexity, empirical results show that the proposed kernel achieves a matching performance with a smaller $r$. We demonstrate comprehensive experiments to show the effective use of the proposed kernel on data sizes up to the order of millions.
- The large-sample properties of likelihood-based statistical inference under mixture models have received much attention from statisticians. Although the consistency of the nonparametric MLE is regarded as a standard conclusion, many researchers ignore the precise conditions required on the mixture model. An incorrect claim of consistency can lead to false conclusions even if the mixture model under investigation seems well behaved. Under a finite normal mixture model, for instance, the consistency of the plain MLE is often erroneously assumed in spite of recent research breakthroughs. This paper streamlines the consistency results for the nonparametric MLE in general, and in particular for the penalized MLE under finite normal mixture models.
- Jun 03 2016 stat.ML arXiv:1606.00832v1We propose a nonconvex estimator for joint multivariate regression and precision matrix estimation in the high dimensional regime, under sparsity constraints. A gradient descent algorithm with hard thresholding is developed to solve the nonconvex estimator, and it attains a linear rate of convergence to the true regression coefficients and precision matrix simultaneously, up to the statistical error. Compared with existing methods along this line of research, which have little theoretical guarantee, the proposed algorithm not only is computationally much more efficient with provable convergence guarantee, but also attains the optimal finite sample statistical rate up to a logarithmic factor. Thorough experiments on both synthetic and real datasets back up our theory.
- Feb 22 2016 stat.ML arXiv:1602.06049v1Dynamic topic models (DTMs) are very effective in discovering topics and capturing their evolution trends in time series data. To do posterior inference of DTMs, existing methods are all batch algorithms that scan the full dataset before each update of the model and make inexact variational approximations with mean-field assumptions. Due to a lack of a more scalable inference algorithm, despite the usefulness, DTMs have not captured large topic dynamics. This paper fills this research void, and presents a fast and parallelizable inference algorithm using Gibbs Sampling with Stochastic Gradient Langevin Dynamics that does not make any unwarranted assumptions. We also present a Metropolis-Hastings based $O(1)$ sampler for topic assignments for each word token. In a distributed environment, our algorithm requires very little communication between workers during sampling (almost embarrassingly parallel) and scales up to large-scale applications. We are able to learn the largest Dynamic Topic Model to our knowledge, and learned the dynamics of 1,000 topics from 2.6 million documents in less than half an hour, and our empirical results show that our algorithm is not only orders of magnitude faster than the baselines but also achieves lower perplexity.
- Streaming variational Bayes (SVB) is successful in learning LDA models in an online manner. However previous attempts toward developing online Monte-Carlo methods for LDA have little success, often by having much worse perplexity than their batch counterparts. We present a streaming Gibbs sampling (SGS) method, an online extension of the collapsed Gibbs sampling (CGS). Our empirical study shows that SGS can reach similar perplexity as CGS, much better than SVB. Our distributed version of SGS, DSGS, is much more scalable than SVB mainly because the updates' communication complexity is small.
- Developing efficient and scalable algorithms for Latent Dirichlet Allocation (LDA) is of wide interest for many applications. Previous work has developed an O(1) Metropolis-Hastings sampling method for each token. However, the performance is far from being optimal due to random accesses to the parameter matrices and frequent cache misses. In this paper, we first carefully analyze the memory access efficiency of existing algorithms for LDA by the scope of random access, which is the size of the memory region in which random accesses fall, within a short period of time. We then develop WarpLDA, an LDA sampler which achieves both the best O(1) time complexity per token and the best O(K) scope of random access. Our empirical results in a wide range of testing conditions demonstrate that WarpLDA is consistently 5-15x faster than the state-of-the-art Metropolis-Hastings based LightLDA, and is comparable or faster than the sparsity aware F+LDA. With WarpLDA, users can learn up to one million topics from hundreds of millions of documents in a few hours, at an unprecedentedly throughput of 11G tokens per second.
- We consider the problem of computing a positive definite $p \times p$ inverse covariance matrix aka precision matrix $\theta=(\theta_{ij})$ which optimizes a regularized Gaussian maximum likelihood problem, with the elastic-net regularizer $\sum_{i,j=1}^{p} \lambda (\alpha|\theta_{ij}| + \frac{1}{2}(1- \alpha) \theta_{ij}^2),$ with regularization parameters $\alpha \in [0,1]$ and $\lambda>0$. The associated convex semidefinite optimization problem is notoriously difficult to scale to large problems and has demanded significant attention over the past several years. We propose a new algorithmic framework based on stochastic proximal optimization (on the primal problem) that can be used to obtain near optimal solutions with substantial computational savings over deterministic algorithms. A key challenge of our work stems from the fact that the optimization problem being investigated does not satisfy the usual assumptions required by stochastic gradient methods. Our proposal has (a) computational guarantees and (b) scales well to large problems, even if the solution is not too sparse; thereby, enhancing the scope of regularized maximum likelihood problems to many large-scale problems of contemporary interest. An important aspect of our proposal is to bypass the \emphdeterministic computation of a matrix inverse by drawing random samples from a suitable multivariate Gaussian distribution.
- In this article, we study a partially linear single-index model for longitudinal data under a general framework which includes both the sparse and dense longitudinal data cases. A semiparametric estimation method based on a combination of the local linear smoothing and generalized estimation equations (GEE) is introduced to estimate the two parameter vectors as well as the unknown link function. Under some mild conditions, we derive the asymptotic properties of the proposed parametric and nonparametric estimators in different scenarios, from which we find that the convergence rates and asymptotic variances of the proposed estimators for sparse longitudinal data would be substantially different from those for dense longitudinal data. We also discuss the estimation of the covariance (or weight) matrices involved in the semiparametric GEE method. Furthermore, we provide some numerical studies including Monte Carlo simulation and an empirical application to illustrate our methodology and theory.
- We propose the convex factorization machine (CFM), which is a convex variant of the widely used Factorization Machines (FMs). Specifically, we employ a linear+quadratic model and regularize the linear term with the $\ell_2$-regularizer and the quadratic term with the trace norm regularizer. Then, we formulate the CFM optimization as a semidefinite programming problem and propose an efficient optimization procedure with Hazan's algorithm. A key advantage of CFM over existing FMs is that it can find a globally optimal solution, while FMs may get a poor locally optimal solution since the objective function of FMs is non-convex. In addition, the proposed algorithm is simple yet effective and can be implemented easily. Finally, CFM is a general factorization method and can also be used for other factorization problems including including multi-view matrix factorization and tensor completion problems. Through synthetic and movielens datasets, we first show that the proposed CFM achieves results competitive to FMs. Furthermore, in a toxicogenomics prediction task, we show that CFM outperforms a state-of-the-art tensor factorization method.
- Apr 21 2015 stat.ME arXiv:1504.04935v1For testing the independence of two vectors with respective dimensions $p_1$ and $p_2$, the existing literature in high-dimensional statistics all assume that both dimensions $p_1$ and $p_2$ grow to infinity with the sample size. However, as evidenced in the RNA-sequencing data analysis discussed in the paper, it happens frequently that one of the dimension is quite small and the other quite large compared to the sample size. In this paper, we address this new asymptotic framework for the independence test. A new test procedure is introduced and its asymptotic normality is established when the vectors are normal distributed. A Mote-Carlo study demonstrates the consistency of the procedure and exhibits its superiority over some existing high-dimensional procedures. Applied to the RNA-sequencing data mentioned above, we obtain very convincing results on pairwise independence/dependence of gene isoform expressions as attested by prior knowledge established in that field. Lastly, Monte-Carlo experiments show that the procedure is robust against the normality assumption on the population vectors.
- Explosive growth in data and availability of cheap computing resources have sparked increasing interest in Big learning, an emerging subfield that studies scalable machine learning algorithms, systems, and applications with Big Data. Bayesian methods represent one important class of statistic methods for machine learning, with substantial recent developments on adaptive, flexible and scalable Bayesian learning. This article provides a survey of the recent advances in Big learning with Bayesian methods, termed Big Bayesian Learning, including nonparametric Bayesian methods for adaptively inferring model complexity, regularized Bayesian inference for improving the flexibility via posterior regularization, and scalable algorithms and systems based on stochastic subsampling and distributed computing for dealing with large-scale applications.
- The expressive power of a Gaussian process (GP) model comes at a cost of poor scalability in the data size. To improve its scalability, this paper presents a low-rank-cum-Markov approximation (LMA) of the GP model that is novel in leveraging the dual computational advantages stemming from complementing a low-rank approximate representation of the full-rank GP based on a support set of inputs with a Markov approximation of the resulting residual process; the latter approximation is guaranteed to be closest in the Kullback-Leibler distance criterion subject to some constraint and is considerably more refined than that of existing sparse GP models utilizing low-rank representations due to its more relaxed conditional independence assumption (especially with larger data). As a result, our LMA method can trade off between the size of the support set and the order of the Markov property to (a) incur lower computational cost than such sparse GP models while achieving predictive performance comparable to them and (b) accurately represent features/patterns of any scale. Interestingly, varying the Markov order produces a spectrum of LMAs with PIC approximation and full-rank GP at the two extremes. An advantage of our LMA method is that it is amenable to parallelization on multiple machines/cores, thereby gaining greater scalability. Empirical evaluation on three real-world datasets in clusters of up to 32 computing nodes shows that our centralized and parallel LMA methods are significantly more time-efficient and scalable than state-of-the-art sparse and full-rank GP regression methods while achieving comparable predictive performances.
- Gaussian processes (GP) are Bayesian non-parametric models that are widely used for probabilistic regression. Unfortunately, it cannot scale well with large data nor perform real-time predictions due to its cubic time cost in the data size. This paper presents two parallel GP regression methods that exploit low-rank covariance matrix approximations for distributing the computational load among parallel machines to achieve time efficiency and scalability. We theoretically guarantee the predictive performances of our proposed parallel GPs to be equivalent to that of some centralized approximate GP regression methods: The computation of their centralized counterparts can be distributed among parallel machines, hence achieving greater time efficiency and scalability. We analytically compare the properties of our parallel GPs such as time, space, and communication complexity. Empirical evaluation on two real-world datasets in a cluster of 20 computing nodes shows that our parallel GPs are significantly more time-efficient and scalable than their centralized counterparts and exact/full GP while achieving predictive performances comparable to full GP.
- Apr 30 2014 stat.OT arXiv:1404.7208v2Sample-average approximations (SAA) are a practical means of finding approximate solutions of stochastic programming problems involving an extremely large (or infinite) number of scenarios. SAA can also be used to find estimates of a lower bound on the optimal objective value of the true problem which, when coupled with an upper bound, provides confidence intervals for the true optimal objective value and valuable information about the quality of the approximate solutions. Specifically, the lower bound can be estimated by solving multiple SAA problems (each obtained using a particular sampling method) and averaging the obtained objective values. State-of-the-art methods for lower-bound estimation generate batches of scenarios for the SAA problems independently. In this paper, we describe sampling methods that produce negatively dependent batches, thus reducing the variance of the sample-averaged lower bound estimator and increasing its usefulness in defining a confidence interval for the optimal objective value. We provide conditions under which the new sampling methods can reduce the variance of the lower bound estimator, and present computational results to verify that our scheme can reduce the variance significantly, by comparison with the traditional Latin hypercube approach.
- Central to robot exploration and mapping is the task of persistent localization in environmental fields characterized by spatially correlated measurements. This paper presents a Gaussian process localization (GP-Localize) algorithm that, in contrast to existing works, can exploit the spatially correlated field measurements taken during a robot's exploration (instead of relying on prior training data) for efficiently and scalably learning the GP observation model online through our proposed novel online sparse GP. As a result, GP-Localize is capable of achieving constant time and memory (i.e., independent of the size of the data) per filtering step, which demonstrates the practical feasibility of using GPs for persistent robot localization and autonomy. Empirical evaluation via simulated experiments with real-world datasets and a real robot experiment shows that GP-Localize outperforms existing GP localization algorithms.
- Dec 11 2013 stat.AP arXiv:1312.2687v1We discuss the statistical properties of a recently introduced unbiased stochastic approximation to the score equations for maximum likelihood calculation for Gaussian processes. Under certain conditions, including bounded condition number of the covariance matrix, the approach achieves $O(n)$ storage and nearly $O(n)$ computational effort per optimization step, where $n$ is the number of data sites. Here, we prove that if the condition number of the covariance matrix is bounded, then the approximate score equations are nearly optimal in a well-defined sense. Therefore, not only is the approximation efficient to compute, but it also has comparable statistical properties to the exact maximum likelihood estimates. We discuss a modification of the stochastic approximation in which design elements of the stochastic terms mimic patterns from a $2^n$ factorial design. We prove these designs are always at least as good as the unstructured design, and we demonstrate through simulation that they can produce a substantial improvement over random designs. Our findings are validated by numerical experiments on simulated data sets of up to 1 million observations. We apply the approach to fit a space-time model to over 80,000 observations of total column ozone contained in the latitude band $40^{\circ}$-$50^{\circ}$N during April 2012.
- Nov 01 2013 stat.ML arXiv:1310.8612v1Incorporating spatial information into hyperspectral unmixing procedures has been shown to have positive effects, due to the inherent spatial-spectral duality in hyperspectral scenes. Current research works that consider spatial information are mainly focused on the linear mixing model. In this paper, we investigate a variational approach to incorporating spatial correlation into a nonlinear unmixing procedure. A nonlinear algorithm operating in reproducing kernel Hilbert spaces, associated with an $\ell_1$ local variation norm as the spatial regularizer, is derived. Experimental results, with both synthetic and real data, illustrate the effectiveness of the proposed scheme.
- Nov 01 2013 stat.ML arXiv:1310.8618v1The kernel least-mean-square (KLMS) algorithm is an appealing tool for online identification of nonlinear systems due to its simplicity and robustness. In addition to choosing a reproducing kernel and setting filter parameters, designing a KLMS adaptive filter requires to select a so-called dictionary in order to get a finite-order model. This dictionary has a significant impact on performance, and requires careful consideration. Theoretical analysis of KLMS as a function of dictionary setting has rarely, if ever, been addressed in the literature. In an analysis previously published by the authors, the dictionary elements were assumed to be governed by the same probability density function of the input data. In this paper, we modify this study by considering the dictionary as part of the filter parameters to be set. This theoretical analysis paves the way for future investigations on KLMS dictionary design.
- This paper presents a hypothesis testing method given independent samples from a number of connected populations. The method is motivated by a forestry project for monitoring change in the strength of lumber. Traditional practice has been built upon nonparametric methods which ignore the fact that these populations are connected. By pooling the information in multiple samples through a density ratio model, the proposed empirical likelihood method leads to a more efficient inference and therefore reduces the cost in applications. The new test has a classical chi-square null limiting distribution. Its power function is obtained under a class of local alternatives. The local power is found increased even when some underlying populations are unrelated to the hypothesis of interest. Simulation studies confirm that this test has better power properties than potential competitors, and is robust to model misspecification. An application example to lumber strength is included.
- Population quantiles and their functions are important parameters in many applications. For example, the lower quantiles often serve as crucial quality indices for forestry products. Given several independent samples from populations satisfying the density ratio model, we investigate the properties of empirical likelihood (EL) based inferences. The induced EL quantile estimators are shown to admit a Bahadur representation that leads to asymptotically valid confidence intervals for functions of quantiles. We rigorously prove that EL quantiles based on all the samples are more efficient than empirical quantiles based on individual samples. A simulation study shows that the EL quantiles and their functions have superior performance when the density ratio model assumption is satisfied and when it is mildly violated. An example is used to demonstrate the new method and the potential cost savings.
- Jun 25 2013 stat.ML arXiv:1306.5310v1Adaptive filtering algorithms operating in reproducing kernel Hilbert spaces have demonstrated superiority over their linear counterpart for nonlinear system identification. Unfortunately, an undesirable characteristic of these methods is that the order of the filters grows linearly with the number of input data. This dramatically increases the computational burden and memory requirement. A variety of strategies based on dictionary learning have been proposed to overcome this severe drawback. Few, if any, of these works analyze the problem of updating the dictionary in a time-varying environment. In this paper, we present an analytical study of the convergence behavior of the Gaussian least-mean-square algorithm in the case where the statistics of the dictionary elements only partially match the statistics of the input data. This allows us to emphasize the need for updating the dictionary in an online way, by discarding the obsolete elements and adding appropriate ones. We introduce a kernel least-mean-square algorithm with L1-norm regularization to automatically perform this task. The stability in the mean of this method is analyzed, and its performance is tested with experiments.
- Gaussian processes (GP) are Bayesian non-parametric models that are widely used for probabilistic regression. Unfortunately, it cannot scale well with large data nor perform real-time predictions due to its cubic time cost in the data size. This paper presents two parallel GP regression methods that exploit low-rank covariance matrix approximations for distributing the computational load among parallel machines to achieve time efficiency and scalability. We theoretically guarantee the predictive performances of our proposed parallel GPs to be equivalent to that of some centralized approximate GP regression methods: The computation of their centralized counterparts can be distributed among parallel machines, hence achieving greater time efficiency and scalability. We analytically compare the properties of our parallel GPs such as time, space, and communication complexity. Empirical evaluation on two real-world datasets in a cluster of 20 computing nodes shows that our parallel GPs are significantly more time-efficient and scalable than their centralized counterparts and exact/full GP while achieving predictive performances comparable to full GP.
- May 24 2013 stat.AP arXiv:1305.5355v1With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are bacterial taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. We propose to model the taxa counts using a Dirichlet-multinomial (DM) regression model in order to account for overdispersion of observed counts. The DM regression model can be used for testing the association between taxa composition and covariates using the likelihood ratio test. However, when the number of covariates is large, multiple testing can lead to loss of power. To address the high dimensionality of the problem, we develop a penalized likelihood approach to estimate the regression parameters and to select the variables by imposing a sparse group $\ell_1$ penalty to encourage both group-level and within-group sparsity. Such a variable selection procedure can lead to selection of the relevant covariates and their associated bacterial taxa. An efficient block-coordinate descent algorithm is developed to solve the optimization problem. We present extensive simulations to demonstrate that the sparse DM regression can result in better identification of the microbiome-associated covariates than models that ignore overdispersion or only consider the proportions. We demonstrate the power of our method in an analysis of a data set evaluating the effects of nutrient intake on human gut microbiome composition. Our results have clearly shown that the nutrient intake is strongly associated with the human gut microbiome.
- We consider stochastic strongly convex optimization with a complex inequality constraint. This complex inequality constraint may lead to computationally expensive projections in algorithmic iterations of the stochastic gradient descent~(SGD) methods. To reduce the computation costs pertaining to the projections, we propose an Epoch-Projection Stochastic Gradient Descent~(Epro-SGD) method. The proposed Epro-SGD method consists of a sequence of epochs; it applies SGD to an augmented objective function at each iteration within the epoch, and then performs a projection at the end of each epoch. Given a strongly convex optimization and for a total number of $T$ iterations, Epro-SGD requires only $\log(T)$ projections, and meanwhile attains an optimal convergence rate of $O(1/T)$, both in expectation and with a high probability. To exploit the structure of the optimization problem, we propose a proximal variant of Epro-SGD, namely Epro-ORDA, based on the optimal regularized dual averaging method. We apply the proposed methods on real-world applications; the empirical results demonstrate the effectiveness of our methods.
- Estimation of the population spectral distribution from a large dimensional sample covariance matrixFeb 05 2013 stat.ME arXiv:1302.0355v1This paper introduces a new method to estimate the spectral distribution of a population covariance matrix from high-dimensional data. The method is founded on a meaningful generalization of the seminal Marcenko-Pastur equation, originally defined in the complex plan, to the real line. Beyond its easy implementation and the established asymptotic consistency, the new estimator outperforms two existing estimators from the literature in almost all the situations tested in a simulation experiment. An application to the analysis of the correlation matrix of S&P stocks data is also given.
- We study the problem of estimating multiple predictive functions from a dictionary of basis functions in the nonparametric regression setting. Our estimation scheme assumes that each predictive function can be estimated in the form of a linear combination of the basis functions. By assuming that the coefficient matrix admits a sparse low-rank structure, we formulate the function estimation problem as a convex program regularized by the trace norm and the $\ell_1$-norm simultaneously. We propose to solve the convex program using the accelerated gradient (AG) method and the alternating direction method of multipliers (ADMM) respectively; we also develop efficient algorithms to solve the key components in both AG and ADMM. In addition, we conduct theoretical analysis on the proposed function estimation scheme: we derive a key property of the optimal solution to the convex program; based on an assumption on the basis functions, we establish a performance bound of the proposed function estimation scheme (via the composite regularization). Simulation studies demonstrate the effectiveness and efficiency of the proposed algorithms.
- In this paper, we consider a partially linear model of the form $Y_t=X_t^{\tau}\theta_0+g(V_t)+\epsilon_t$, $t=1,...,n$, where $\{V_t\}$ is a $\beta$ null recurrent Markov chain, $\{X_t\}$ is a sequence of either strictly stationary or non-stationary regressors and $\{\epsilon_t\}$ is a stationary sequence. We propose to estimate both $\theta_0$ and $g(\cdot)$ by a semi-parametric least-squares (SLS) estimation method. Under certain conditions, we then show that the proposed SLS estimator of $\theta_0$ is still asymptotically normal with the same rate as for the case of stationary time series. In addition, we also establish an asymptotic distribution for the nonparametric estimator of the function $g(\cdot)$. Some numerical examples are provided to show that our theory and estimation method work well in practice.
- We investigate the implications of free probability for random matrices. From rules for calculating all possible joint moments of two free random matrices, we develop a notion of partial freeness which is quantified by the breakdown of these rules. We provide a combinatorial interpretation for partial freeness as the presence of closed paths in Hilbert space defined by particular joint moments. We also discuss how asymptotic moment expansions provide an error term on the density of states. We present MATLAB code for the calculation of moments and free cumulants of arbitrary random matrices.
- Theoretical studies of localization, anomalous diffusion and ergodicity breaking require solving the electronic structure of disordered systems. We use free probability to approximate the ensemble- averaged density of states without exact diagonalization. We present an error analysis that quantifies the accuracy using a generalized moment expansion, allowing us to distinguish between different approximations. We identify an approximation that is accurate to the eighth moment across all noise strengths, and contrast this with the perturbation theory and isotropic entanglement theory.
- Nov 30 2010 stat.ME arXiv:1011.6298v1In this paper, we study a kernel smoothing approach for denoising a tensor field. Particularly, both simulation studies and theoretical analysis are conducted to understand the effects of the noise structure and the structure of the tensor field on the performance of different smoothers arising from using different metrics, viz., Euclidean, log-Euclidean and affine invariant metrics. We also study the Rician noise model and compare two regression estimators of diffusion tensors based on raw diffusion weighted imaging data at each voxel.
- Oct 07 2010 stat.AP arXiv:1010.1157v1We present an applied study in cancer genomics for integrating data and inferences from laboratory experiments on cancer cell lines with observational data obtained from human breast cancer studies. The biological focus is on improving understanding of transcriptional responses of tumors to changes in the pH level of the cellular microenvironment. The statistical focus is on connecting experimentally defined biomarkers of such responses to clinical outcome in observational studies of breast cancer patients. Our analysis exemplifies a general strategy for accomplishing this kind of integration across contexts. The statistical methodologies employed here draw heavily on Bayesian sparse factor models for identifying, modularizing and correlating with clinical outcome these signatures of aggregate changes in gene expression. By projecting patterns of biological response linked to specific experimental interventions into observational studies where such responses may be evidenced via variation in gene expression across samples, we are able to define biomarkers of clinically relevant physiological states and outcomes that are rooted in the biology of the original experiment. Through this approach we identify microenvironment-related prognostic factors capable of predicting long term survival in two independent breast cancer datasets. These results suggest possible directions for future laboratory studies, as well as indicate the potential for therapeutic advances though targeted disruption of specific pathway components.
- Empirical likelihood is a popular nonparametric or semi-parametric statistical method with many nice statistical properties. Yet when the sample size is small, or the dimension of the accompanying estimating function is high, the application of the empirical likelihood method can be hindered by low precision of the chi-square approximation and by nonexistence of solutions to the estimating equations. In this paper, we show that the adjusted empirical likelihood is effective at addressing both problems. With a specific level of adjustment, the adjusted empirical likelihood achieves the high-order precision of the Bartlett correction, in addition to the advantage of a guaranteed solution to the estimating equations. Simulation results indicate that the confidence regions constructed by the adjusted empirical likelihood have coverage probabilities comparable to or substantially more accurate than the original empirical likelihood enhanced by the Bartlett correction.