results for au:Chen_J in:stat

- Principal component analysis (PCA) is widely used for feature extraction and dimensionality reduction, with documented merits in diverse tasks involving high-dimensional data. Standard PCA copes with one dataset at a time, but it is challenged when it comes to analyzing multiple datasets jointly. In certain data science settings however, one is often interested in extracting the most discriminative information from one dataset of particular interest (a.k.a. target data) relative to the other(s) (a.k.a. background data). To this end, this paper puts forth a novel approach, termed discriminative (d) PCA, for such discriminative analytics of multiple datasets. Under certain conditions, dPCA is proved to be least-squares optimal in recovering the component vector unique to the target data relative to background data. To account for nonlinear data correlations, (linear) dPCA models for one or multiple background datasets are generalized through kernel-based learning. Interestingly, all dPCA variants admit an analytical solution obtainable with a single (generalized) eigenvalue decomposition. Finally, corroborating dimensionality reduction tests using both synthetic and real datasets are provided to validate the effectiveness of the proposed methods.
- In this paper we explore different regression models based on Clusterwise Linear Regression (CLR). CLR aims to find the partition of the data into $k$ clusters, such that linear regressions fitted to each of the clusters minimize overall mean squared error on the whole data. The main obstacle preventing to use found regression models for prediction on the unseen test points is the absence of a reasonable way to obtain CLR cluster labels when the values of target variable are unknown. In this paper we propose two novel approaches on how to solve this problem. The first approach, predictive CLR builds a separate classification model to predict test CLR labels. The second approach, constrained CLR utilizes a set of user-specified constraints that enforce certain points to go to the same clusters. Assuming the constraint values are known for the test points, they can be directly used to assign CLR labels. We evaluate these two approaches on three UCI ML datasets as well as on a large corpus of health insurance claims. We show that both of the proposed algorithms significantly improve over the known CLR-based regression methods. Moreover, predictive CLR consistently outperforms linear regression and random forest, and shows comparable performance to support vector regression on UCI ML datasets. The constrained CLR approach achieves the best performance on the health insurance dataset, while enjoying only $\approx 20$ times increased computational time over linear regression.
- Probabilistic topic models are popular unsupervised learning methods, including probabilistic latent semantic indexing (pLSI) and latent Dirichlet allocation (LDA). By now, their training is implemented on general purpose computers (GPCs), which are flexible in programming but energy-consuming. Towards low-energy implementations, this paper investigates their training on an emerging hardware technology called the neuromorphic multi-chip systems (NMSs). NMSs are very effective for a family of algorithms called spiking neural networks (SNNs). We present three SNNs to train topic models. The first SNN is a batch algorithm combining the conventional collapsed Gibbs sampling (CGS) algorithm and an inference SNN to train LDA. The other two SNNs are online algorithms targeting at both energy- and storage-limited environments. The two online algorithms are equivalent with training LDA by using maximum-a-posterior estimation and maximizing the semi-collapsed likelihood, respectively. They use novel, tailored ordinary differential equations for stochastic optimization. We simulate the new algorithms and show that they are comparable with the GPC algorithms, while being suitable for NMS implementation. We also propose an extension to train pLSI and a method to prune the network to obey the limited fan-in of some NMSs.
- In the NIPS 2017 Learning to Run challenge, participants were tasked with building a controller for a musculoskeletal model to make it run as fast as possible through an obstacle course. Top participants were invited to describe their algorithms. In this work, we present eight solutions that used deep reinforcement learning approaches, based on algorithms such as Deep Deterministic Policy Gradient, Proximal Policy Optimization, and Trust Region Policy Optimization. Many solutions use similar relaxations and heuristics, such as reward shaping, frame skipping, discretization of the action space, symmetry, and policy blending. However, each of the eight teams implemented different modifications of the known algorithms.
- Canonical correlation analysis (CCA) is a powerful technique for discovering whether or not hidden sources are commonly present in two (or more) datasets. Its well-appreciated merits include dimensionality reduction, clustering, classification, feature selection, and data fusion. The standard CCA however, does not exploit the geometry of the common sources, which may be available from the given data or can be deduced from (cross-) correlations. In this paper, this extra information provided by the common sources generating the data is encoded in a graph, and is invoked as a graph regularizer. This leads to a novel graph-regularized CCA approach, that is termed graph (g) CCA. The novel gCCA accounts for the graph-induced knowledge of common sources, while minimizing the distance between the wanted canonical variables. Tailored for diverse practical settings where the number of data is smaller than the data vector dimensions, the dual formulation of gCCA is also developed. One such setting includes kernels that are incorporated to account for nonlinear data dependencies. The resultant graph-kernel (gk) CCA is also obtained in closed form. Finally, corroborating image classification tests over several real datasets are presented to showcase the merits of the novel linear, dual, and kernel approaches relative to competing alternatives.
- Operating in a dynamic real world environment requires a forward thinking and adversarial aware design for classifiers, beyond fitting the model to the training data. In such scenarios, it is necessary to make classifiers - a) harder to evade, b) easier to detect changes in the data distribution over time, and c) be able to retrain and recover from model degradation. While most works in the security of machine learning has concentrated on the evasion resistance (a) problem, there is little work in the areas of reacting to attacks (b and c). Additionally, while streaming data research concentrates on the ability to react to changes to the data distribution, they often take an adversarial agnostic view of the security problem. This makes them vulnerable to adversarial activity, which is aimed towards evading the concept drift detection mechanism itself. In this paper, we analyze the security of machine learning, from a dynamic and adversarial aware perspective. The existing techniques of Restrictive one class classifier models, Complex learning models and Randomization based ensembles, are shown to be myopic as they approach security as a static task. These methodologies are ill suited for a dynamic environment, as they leak excessive information to an adversary, who can subsequently launch attacks which are indistinguishable from the benign data. Based on empirical vulnerability analysis against a sophisticated adversary, a novel feature importance hiding approach for classifier design, is proposed. The proposed design ensures that future attacks on classifiers can be detected and recovered from. The proposed work presents motivation, by serving as a blueprint, for future work in the area of Dynamic-Adversarial mining, which combines lessons learned from Streaming data mining, Adversarial learning and Cybersecurity.
- Mar 13 2018 stat.ME arXiv:1803.03875v1Objectives: This study provides an effective model selection method based on the empirical likelihood approach for constructing summary receiver operating characteristic (sROC) curves from meta-analyses of diagnostic studies. Methods: We considered models from combinations of family indices and specific pairs of transformations, which cover several widely used methods for bivariate summary of sensitivity and specificity. Then a final model was selected using the proposed empirical likelihood method. Simulation scenarios were conducted based on different number of studies and different population distributions for the disease and non-disease cases. The performance of our proposal and other model selection criteria was also compared. Results: Although parametric likelihood-based methods are often applied in practice due to its asymptotic property, they fail to consistently choose appropriate models for summary under the limited number of studies. For these situations, our proposed method almost always performs better. Conclusion: When the number of studies is as small as 10 or 5, we recommend choosing a summary model via the proposed empirical likelihood method.
- We introduce instancewise feature selection as a methodology for model interpretation. Our method is based on learning a function to extract a subset of features that are most informative for each given example. This feature selector is trained to maximize the mutual information between selected features and the response variable, where the conditional distribution of the response variable given the input is the model to be explained. We develop an efficient variational approximation to the mutual information, and show that the resulting method compares favorably to other model explanation methods on a variety of synthetic and real data sets using both quantitative metrics and human evaluation.
- To ensure undisrupted business, large Internet companies need to closely monitor various KPIs (e.g., Page Views, number of online users, and number of orders) of its Web applications, to accurately detect anomalies and trigger timely troubleshooting/mitigation. However, anomaly detection for these seasonal KPIs with various patterns and data quality has been a great challenge, especially without labels. In this paper, we proposed Donut, an unsupervised anomaly detection algorithm based on VAE. Thanks to a few of our key techniques, Donut greatly outperforms a state-of-arts supervised ensemble approach and a baseline VAE approach, and its best F-scores range from 0.75 to 0.9 for the studied KPIs from a top global Internet company. We come up with a novel KDE interpretation of reconstruction for Donut, making it the first VAE-based anomaly detection algorithm with solid theoretical explanation.
- Jan 23 2018 stat.AP arXiv:1801.06596v1On September 10, 2017, Hurricane Irma made landfall in the Florida Keys and caused significant damage. Informed by hydrodynamic storm surge and wave modeling and post-storm satellite imagery, a rapid damage survey was soon conducted for 1600+ residential buildings in Big Pine Key and Marathon. Damage categorizations and statistical analysis reveal distinct factors governing damage at these two locations. The distance from the coast is significant for the damage in Big Pine Key, as severely damaged buildings were located near narrow waterways connected to the ocean. Building type and size are critical in Marathon, highlighted by the near-complete destruction of trailer communities there. These observations raise issues of affordability and equity that need consideration in damage recovery and rebuilding for resilience.
- Deep neural networks are vulnerable to adversarial examples. Prior defenses attempted to make deep networks more robust by either improving the network architecture or adding adversarial examples into the training set, with their respective limitations. We propose a new direction. Motivated by recent research that shows that outliers in the training set have a high negative influence on the trained model, our approach makes the model more robust by detecting and removing outliers in the training set without modifying the network architecture or requiring adversarial examples. We propose two methods for detecting outliers based on canonical examples and on training errors, respectively. After removing the outliers, we train the classifier with the remaining examples to obtain a sanitized model. Our evaluation shows that the sanitized model improves classification accuracy and forces the attacks to generate adversarial examples with higher distortions. Moreover, the Kullback-Leibler divergence from the output of the original model to that of the sanitized model allows us to distinguish between normal and adversarial examples reliably.
- We compare and contrast the statistical physics and quantum physics inspired approaches for unsupervised generative modeling of classical data. The two approaches represent probabilities of observed data using energy-based models and quantum states respectively.Classical and quantum information patterns of the target datasets therefore provide principled guidelines for structural design and learning in these two approaches. Taking the restricted Boltzmann machines (RBM) as an example, we analyze the information theoretical bounds of the two approaches. We verify our reasonings by comparing the performance of RBMs of various architectures on the standard MNIST datasets.
- In a number of cases, the Quantile Gaussian Process (QGP) has proven effective in emulating stochastic, univariate computer model output (Plumlee and Tuo, 2014). In this paper, we develop an approach that uses this emulation approach within a Bayesian model calibration framework to calibrate an agent-based model of an epidemic. In addition, this approach is extended to handle the multivariate nature of the model output, which gives a time series of the count of infected individuals. The basic modeling approach is adapted from Higdon et al. (2008), using a basis representation to capture the multivariate model output. The approach is motivated with an example taken from the 2015 Ebola Challenge workshop which simulated an ebola epidemic to evaluate methodology.
- Gaussian random fields (GRF) are a fundamental stochastic model for spatiotemporal data analysis. An essential ingredient of GRF is the covariance function that characterizes the joint Gaussian distribution of the field. Commonly used covariance functions give rise to fully dense and unstructured covariance matrices, for which required calculations are notoriously expensive to carry out for large data. In this work, we propose a construction of covariance functions that result in matrices with a hierarchical structure. Empowered by matrix algorithms that scale linearly with the matrix dimension, the hierarchical structure is proved to be efficient for a variety of random field computations, including sampling, kriging, and likelihood evaluation. Specifically, with $n$ scattered sites, sampling and likelihood evaluation has an $O(n)$ cost and kriging has an $O(\log n)$ cost after preprocessing, particularly favorable for the kriging of an extremely large number of sites (e.g., predicting on more sites than observed). We demonstrate comprehensive numerical experiments to show the use of the constructed covariance functions and their appealing computation time. Numerical examples on a laptop include simulated data of size up to one million, as well as a climate data product with over two million observations.
- Kernel PCA is a widely used nonlinear dimension reduction technique in machine learning, but storing the kernel matrix is notoriously challenging when the sample size is large. Inspired by Yi et al. [2016], where the idea of partial matrix sampling followed by nonconvex optimization is proposed for matrix completion and robust PCA, we apply a similar approach to memory-efficient Kernel PCA. In theory, with no assumptions on the kernel matrix in terms of eigenvalues or eigenvectors, we established a model-free theory for the low-rank approximation based on any local minimum of the proposed objective function. As interesting byproducts, when the underlying positive semidefinite matrix is assumed to be low-rank and highly structured, corollaries of our main theorem improve the state-of-the-art results of Ge et al. [2016, 2017] for nonconvex matrix completion with no spurious local minima. Numerical experiments also show that our approach is competitive in terms of approximation accuracy compared to the well-known Nyström algorithm for Kernel PCA.
- We propose a new localized inference algorithm for answering marginalization queries in large graphical models with the correlation decay property. Given a query variable and a large graphical model, we define a much smaller model in a local region around the query variable in the target model so that the marginal distribution of the query variable can be accurately approximated. We introduce two approximation error bounds based on the Dobrushin's comparison theorem and apply our bounds to derive a greedy expansion algorithm that efficiently guides the selection of neighbor nodes for localized inference. We verify our theoretical bounds on various datasets and demonstrate that our localized inference algorithm can provide fast and accurate approximation for large graphical models.
- Graph convolutional networks (GCNs) are powerful deep neural networks for graph-structured data. However, GCN computes the representation of a node recursively from its neighbors, making the receptive field size grow exponentially with the number of layers. Previous attempts on reducing the receptive field size by subsampling neighbors do not have a convergence guarantee, and their receptive field size per node is still in the order of hundreds. In this paper, we develop control variate based algorithms which allow sampling an arbitrarily small neighbor size. Furthermore, we prove new theoretical guarantee for our algorithms to converge to a local optimum of GCN. Empirical results show that our algorithms enjoy a similar convergence with the exact algorithm using only two neighbors per node. The runtime of our algorithms on a large Reddit dataset is only one seventh of previous neighbor sampling algorithms.
- Principal component analysis (PCA) has well-documented merits for data extraction and dimensionality reduction. PCA deals with a single dataset at a time, and it is challenged when it comes to analyzing multiple datasets. Yet in certain setups, one wishes to extract the most significant information of one dataset relative to other datasets. Specifically, the interest may be on identifying, namely extracting features that are specific to a single target dataset but not the others. This paper develops a novel approach for such so-termed discriminative data analysis, and establishes its optimality in the least-squares (LS) sense under suitable data modeling assumptions. The criterion reveals linear combinations of variables by maximizing the ratio of the variance of the target data to that of the remainders. The novel approach solves a generalized eigenvalue problem by performing SVD just once. Numerical tests using synthetic and real datasets showcase the merits of the proposed approach relative to its competing alternatives.
- In this paper, we design and analyze a new zeroth-order online algorithm, namely, the zeroth-order online alternating direction method of multipliers (ZOO-ADMM), which enjoys dual advantages of being gradient-free operation and employing the ADMM to accommodate complex structured regularizers. Compared to the first-order gradient-based online algorithm, we show that ZOO-ADMM requires $\sqrt{m}$ times more iterations, leading to a convergence rate of $O(\sqrt{m}/\sqrt{T})$, where $m$ is the number of optimization variables, and $T$ is the number of iterations. To accelerate ZOO-ADMM, we propose two minibatch strategies: gradient sample averaging and observation averaging, resulting in an improved convergence rate of $O(\sqrt{1+q^{-1}m}/\sqrt{T})$, where $q$ is the minibatch size. In addition to convergence analysis, we also demonstrate ZOO-ADMM to applications in signal processing, statistics, and machine learning.
- Identifying arbitrary topologies of power networks in real time is a computationally hard problem due to the number of hypotheses that grows exponentially with the network size. A new "Learning-to-Infer" variational inference method is developed for efficient inference of every line status in the network. Optimizing the variational model is transformed to and solved as a discriminative learning problem based on Monte Carlo samples generated with power flow simulations. A major advantage of the developed Learning-to-Infer method is that the labeled data used for training can be generated in an arbitrarily large amount fast and at very little cost. As a result, the power of offline training is fully exploited to learn very complex classifiers for effective real-time topology identification. The proposed methods are evaluated in the IEEE 30, 118 and 300 bus systems. Excellent performance in identifying arbitrary power network topologies in real time is achieved even with relatively simple variational models and a reasonably small amount of data.
- We propose a top-down algorithm for multiple testing on directed acyclic graphs (DAGs), where nodes represent hypotheses and edges specify a partial ordering in which hypotheses must be tested. The procedure is guaranteed to reject a sub-DAG with bounded false discovery rate (FDR) while satisfying the logical constraint that a rejected node's parents must also be rejected. It is designed for sequential testing settings, when the DAG structure is known a priori, but the p-values are obtained selectively (such as sequential conduction of experiments), but the algorithm is also applicable in non-sequential settings when all p-values can be calculated in advance (such as variable/model selection). Our DAGGER algorithm, shorthand for Greedily Evolving Rejections on DAGs, allows for independence, positive or arbitrary dependence of the p-values, and is guaranteed to work on two different types of DAGs: (a) intersection DAGs in which all nodes are intersection hypotheses, with parents being supersets of children, or (b) general DAGs in which all nodes may be elementary hypotheses. The DAGGER procedure has the appealing property that it specializes to known algorithms in the special cases of trees and line graphs, and simplifies to the classic Benjamini-Hochberg procedure when the DAG has no edges. We explore the empirical performance of DAGGER using simulations, as well as a real dataset corresponding to a gene ontology DAG, showing that it performs favorably in terms of time and power.
- In this paper we introduce ZhuSuan, a python probabilistic programming library for Bayesian deep learning, which conjoins the complimentary advantages of Bayesian methods and deep learning. ZhuSuan is built upon Tensorflow. Unlike existing deep learning libraries, which are mainly designed for deterministic neural networks and supervised tasks, ZhuSuan is featured for its deep root into Bayesian inference, thus supporting various kinds of probabilistic models, including both the traditional hierarchical Bayesian models and recent deep generative models. We use running examples to illustrate the probabilistic programming on ZhuSuan, including Bayesian logistic regression, variational auto-encoders, deep sigmoid belief networks and Bayesian recurrent neural networks.
- In 2015, Barber and Candes introduced a new variable selection procedure called the knockoff filter to control the false discovery rate (FDR) and prove that this method achieves exact FDR control. Inspired by the work of Barber and Candes (2015), we propose and analyze a pseudo-knockoff filter that inherits some advantages of the original knockoff filter and has more flexibility in constructing its knockoff matrix. Although we have not been able to obtain exact FDR control of the pseudo knockoff filter, we show that it satisfies an expectation inequality that offers some insight into FDR control. Moreover, we provide some partial analysis of the pseudo knockoff filter for the half Lasso and the least squares statistics. Our analysis indicates that the inverse of the covariance matrix of the feature matrix plays an important role in designing and analyzing the pseudo knockoff filter. Our preliminary numerical experiments show that the pseudo knockoff filter with the half Lasso statistic has FDR control. Moreover, our numerical experiments show that the pseudo-knockoff filter could offer more power than the original knockoff filter with the OMP or Lasso Path statistic when the features are correlated and non-sparse.
- We present a unified framework to analyze the global convergence of Langevin dynamics based algorithms for nonconvex finite-sum optimization with $n$ component functions. At the core of our analysis is a direct analysis of the ergodicity of the numerical approximations to Langevin dynamics, which leads to faster convergence rates. Specifically, we show that gradient Langevin dynamics (GLD) and stochastic gradient Langevin dynamics (SGLD) converge to the almost minimizer within $\tilde O\big(nd/(\lambda\epsilon) \big)$ and $\tilde O\big(d^7/(\lambda^5\epsilon^5) \big)$ stochastic gradient evaluations respectively, where $d$ is the problem dimension, and $\lambda$ is the spectral gap of the Markov chain generated by GLD. Both of the results improve upon the best known gradient complexity results. Furthermore, for the first time we prove the global convergence guarantee for variance reduced stochastic gradient Langevin dynamics (VR-SGLD) to the almost minimizer after $\tilde O\big(\sqrt{n}d^5/(\lambda^4\epsilon^{5/2})\big)$ stochastic gradient evaluations, which outperforms the gradient complexities of GLD and SGLD in a wide regime. Our theoretical analyses shed some light on using Langevin dynamics based algorithms for nonconvex optimization with provable guarantees.
- We study testing high-dimensional covariance matrices under a generalized elliptical model. The model accommodates several stylized facts of real data including heteroskedasticity, heavy-tailedness, asymmetry, etc. We consider the high-dimensional setting where the dimension $p$ and the sample size $n$ grow to infinity proportionally, and establish a central limit theorem for the linear spectral statistic of the sample covariance matrix based on self-normalized observations. The central limit theorem is different from the existing ones for the linear spectral statistic of the usual sample covariance matrix. Our tests based on the new central limit theorem neither assume a specific parametric distribution nor involve the kurtosis of data. Simulation studies show that our tests work well even when the fourth moment does not exist. Empirically, we analyze the idiosyncratic returns under the Fama-French three-factor model for S\&P 500 Financials sector stocks, and our tests reject the hypothesis that the idiosyncratic returns are uncorrelated.
- We propose a framework for feature selection that employs kernel-based measures of independence to find a subset of covariates that is maximally predictive of the response. Building on past work in kernel dimension reduction, we formulate our approach as a constrained optimization problem involving the trace of the conditional covariance operator, and additionally provide some consistency results. We then demonstrate on a variety of synthetic and real data sets that our method compares favorably with other state-of-the-art algorithms.
- In deep reinforcement learning (RL) tasks, an efficient exploration mechanism should be able to encourage an agent to take actions that lead to less frequent states which may yield higher accumulative future return. However, both knowing about the future and evaluating the frequentness of states are non-trivial tasks, especially for deep RL domains, where a state is represented by high-dimensional image frames. In this paper, we propose a novel informed exploration framework for deep RL, where we build the capability for an RL agent to predict over the future transitions and evaluate the frequentness for the predicted future frames in a meaningful manner. To this end, we train a deep prediction model to predict future frames given a state-action pair, and a convolutional autoencoder model to hash over the seen frames. In addition, to utilize the counts derived from the seen frames to evaluate the frequentness for the predicted frames, we tackle the challenge of matching the predicted future frames and their corresponding seen frames at the latent feature level. In this way, we derive a reliable metric for evaluating the novelty of the future direction pointed by each action, and hence inform the agent to explore the least frequent one.
- We present an efficient score statistic, called the $\textsf{S}^3 \textsf{T}$ statistic, to detect the emergence of a spatially and temporally correlated signal from either fixed-sample or sequential data. The signal may cause a men shift and/or a change in the covariance structure. The score statistic can capture both spatial and temporal structures of the change and hence is particularly powerful in detecting weak signals. The score statistic is computationally efficient and statistically powerful. Our main theoretical contribution are accurate analytical approximations on the false alarm rate of the detection procedures, which can be used to calibrate the threshold analytically. Numerical experiments on simulated and real data demonstrate the good performance of our procedure for solar flame detection and water quality monitoring.
- In many applications, we need to study a linear regression model that consists of a response variable and a large number of potential explanatory variables and determine which variables are truly associated with the response. In 2015, Barber and Candes introduced a new variable selection procedure called the knockoff filter to control the false discovery rate (FDR) and proved that this method achieves exact FDR control. In this paper, we provide some analysis of the knockoff filter and its variants. Based on our analysis, we propose a PCA prototype group selection filter that has exact group FDR control and several advantages over existing group selection methods for strongly correlated features. Another contribution is that we propose a new noise estimator that can be incorporated into the knockoff statistic from a penalized method without violating the exchangeability property. Our analysis also reveals that some knockoff statistics, including the Lasso path and the marginal correlation statistics, suffer from the alternating sign effect. To overcome this deficiency, we introduce the notion of a good statistic and propose several alternative statistics that take advantage of the good statistic property. Finally, we present a number of numerical experiments to demonstrate the effectiveness of our methods and confirm our analysis.
- Jun 06 2017 stat.ME arXiv:1706.01426v1Learning with Reproducing Kernel Hilbert Spaces (RKHS) has been widely used in many scientific disciplines. Because a RKHS can be very flexible, it is common to impose a regularization term in the optimization to prevent overfitting. Standard RKHS learning employs the squared norm penalty of the learning function. Despite its success, many challenges remain. In particular, one cannot directly use the squared norm penalty for variable selection or data extraction. Therefore, when there exists noise predictors, or the underlying function has a sparse representation in the dual space, the performance of standard RKHS learning can be suboptimal. In the literature,work has been proposed on how to perform variable selection in RKHS learning, and a data sparsity constraint was considered for data extraction. However, how to learn in a RKHS with both variable selection and data extraction simultaneously remains unclear. In this paper, we propose a unified RKHS learning method, namely, DOuble Sparsity Kernel (DOSK) learning, to overcome this challenge. An efficient algorithm is provided to solve the corresponding optimization problem. We prove that under certain conditions, our new method can asymptotically achieve variable selection consistency. Simulated and real data results demonstrate that DOSK is highly competitive among existing approaches for RKHS learning.
- This paper deals with finding an $n$-dimensional solution $x$ to a system of quadratic equations of the form $y_i=|\langle{a}_i,x\rangle|^2$ for $1\le i \le m$, which is also known as phase retrieval and is NP-hard in general. We put forth a novel procedure for minimizing the amplitude-based least-squares empirical loss, that starts with a weighted maximal correlation initialization obtainable with a few power or Lanczos iterations, followed by successive refinements based upon a sequence of iteratively reweighted (generalized) gradient iterations. The two (both the initialization and gradient flow) stages distinguish themselves from prior contributions by the inclusion of a fresh (re)weighting regularization technique. The overall algorithm is conceptually simple, numerically scalable, and easy-to-implement. For certain random measurement models, the novel procedure is shown capable of finding the true solution $x$ in time proportional to reading the data $\{(a_i;y_i)\}_{1\le i \le m}$. This holds with high probability and without extra assumption on the signal $x$ to be recovered, provided that the number $m$ of equations is some constant $c>0$ times the number $n$ of unknowns in the signal vector, namely, $m>cn$. Empirically, the upshots of this contribution are: i) (almost) $100\%$ perfect signal recovery in the high-dimensional (say e.g., $n\ge 2,000$) regime given only an information-theoretic limit number of noiseless equations, namely, $m=2n-1$ in the real-valued Gaussian case; and, ii) (nearly) optimal statistical accuracy in the presence of additive noise of bounded support. Finally, substantial numerical tests using both synthetic data and real images corroborate markedly improved signal recovery performance and computational efficiency of our novel procedure relative to state-of-the-art approaches.
- May 30 2017 stat.ME arXiv:1705.10063v1Sample surveys are widely used to obtain information about totals, means, medians, and other parameters of finite populations. In many applications, similar information is desired for subpopulations such as individuals in specific geographic areas and socio-demographic groups. When the surveys are conducted at national or similarly high levels, a probability sampling can result in just a few sampling units from many unplanned subpopulations at the design stage. Cost considerations may also lead to low sample sizes from individual small areas. Estimating the parameters of these subpopulations with satisfactory precision and evaluating their accuracy are serious challenges for statisticians. To overcome the difficulties, statisticians resort to pooling information across the small areas via suitable model assumptions, administrative archives, and census data. In this paper, we develop an array of small area quantile estimators. The novelty is the introduction of a semiparametric density ratio model for the error distribution in the unit-level nested error regression model. In contrast, the existing methods are usually most effective when the response values are jointly normal. We also propose a resampling procedure for estimating the mean square errors of these estimators. Simulation results indicate that the new methods have superior performance when the population distributions are skewed and remain competitive otherwise.
- For various applications, the relations between the dependent and independent variables are highly nonlinear. Consequently, for large scale complex problems, neural networks and regression trees are commonly preferred over linear models such as Lasso. This work proposes learning the feature nonlinearities by binning feature values and finding the best fit in each quantile using non-convex regularized linear regression. The algorithm first captures the dependence between neighboring quantiles by enforcing smoothness via piecewise-constant/linear approximation and then selects a sparse subset of good features. We prove that the proposed algorithm is statistically and computationally efficient. In particular, it achieves linear rate of convergence while requiring near-minimal number of samples. Evaluations on synthetic and real datasets demonstrate that algorithm is competitive with current state-of-the-art and accurately learns feature nonlinearities. Finally, we explore an interesting connection between the binning stage of our algorithm and sparse Johnson-Lindenstrauss matrices.
- In this chapter, we analyze nonlinear filtering problems in distributed environments, e.g., sensor networks or peer-to-peer protocols. In these scenarios, the agents in the environment receive measurements in a streaming fashion, and they are required to estimate a common (nonlinear) model by alternating local computations and communications with their neighbors. We focus on the important distinction between single-task problems, where the underlying model is common to all agents, and multitask problems, where each agent might converge to a different model due to, e.g., spatial dependencies or other factors. Currently, most of the literature on distributed learning in the nonlinear case has focused on the single-task case, which may be a strong limitation in real-world scenarios. After introducing the problem and reviewing the existing approaches, we describe a simple kernel-based algorithm tailored for the multitask case. We evaluate the proposal on a simulated benchmark task, and we conclude by detailing currently open problems and lines of research.
- We consider the robust phase retrieval problem of recovering the unknown signal from the magnitude-only measurements, where the measurements can be contaminated by both sparse arbitrary corruption and bounded random noise. We propose a new nonconvex algorithm for robust phase retrieval, namely Robust Wirtinger Flow to jointly estimate the unknown signal and the sparse corruption. We show that our proposed algorithm is guaranteed to converge linearly to the unknown true signal up to a minimax optimal statistical precision in such a challenging setting. Compared with existing robust phase retrieval methods, we achieve an optimal sample complexity of $O(n)$ in both noisy and noise-free settings. Thorough experiments on both synthetic and real datasets corroborate our theory.
- Multi-parameter one-sided hypothesis test problems arise naturally in many applications. We are particularly interested in effective tests for monitoring multiple quality indices in forestry products. Our search reveals that there are many effective statistical methods in the literature for normal data, and that they can easily be adapted for non-normal data. We find that the beautiful likelihood ratio test is unsatisfactory, because in order to control the size, it must cope with the least favorable distributions at the cost of power. In this paper, we find a novel way to slightly ease the size control, obtaining a much more powerful test. Simulation confirms that the new test retains good control of the type I error and is markedly more powerful than the likelihood ratio test as well as many competitors based on normal data. The new method performs well in the context of monitoring multiple quality indices.
- Policy evaluation is a crucial step in many reinforcement-learning procedures, which estimates a value function that predicts states' long-term value under a given policy. In this paper, we focus on policy evaluation with linear function approximation over a fixed dataset. We first transform the empirical policy evaluation problem into a (quadratic) convex-concave saddle point problem, and then present a primal-dual batch gradient method, as well as two stochastic variance reduction methods for solving the problem. These algorithms scale linearly in both sample size and feature dimension. Moreover, they achieve linear convergence even when the saddle-point problem has only strong concavity in the dual variables but no strong convexity in the primal variables. Numerical experiments on benchmark problems demonstrate the effectiveness of our methods.
- Nested Chinese Restaurant Process (nCRP) topic models are powerful nonparametric Bayesian methods to extract a topic hierarchy from a given text corpus, where the hierarchical structure is automatically determined by the data. Hierarchical Latent Dirichlet Allocation (hLDA) is a popular instance of nCRP topic models. However, hLDA has only been evaluated at small scale, because the existing collapsed Gibbs sampling and instantiated weight variational inference algorithms either are not scalable or sacrifice inference quality with mean-field assumptions. Moreover, an efficient distributed implementation of the data structures, such as dynamically growing count matrices and trees, is challenging. In this paper, we propose a novel partially collapsed Gibbs sampling (PCGS) algorithm, which combines the advantages of collapsed and instantiated weight algorithms to achieve good scalability as well as high model quality. An initialization strategy is presented to further improve the model quality. Finally, we propose an efficient distributed implementation of PCGS through vectorization, pre-processing, and a careful design of the concurrent data structures and communication strategy. Empirical studies show that our algorithm is 111 times more efficient than the previous open-source implementation for hLDA, with comparable or even better model quality. Our distributed implementation can extract 1,722 topics from a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than the previous largest corpus, with 50 machines in 7 hours.
- We test three common information criteria (IC) for selecting the order of a Hawkes process with an intensity kernel that can be expressed as a mixture of exponential terms. These processes find application in high-frequency financial data modelling. The information criteria are Akaike's information criterion (AIC), the Bayesian information criterion (BIC) and the Hannan-Quinn criterion (HQ). Since we work with simulated data, we are able to measure the performance of model selection by the success rate of the IC in selecting the model that was used to generate the data. In particular, we are interested in the relation between correct model selection and underlying sample size. The analysis includes realistic sample sizes and parameter sets from recent literature where parameters were estimated using empirical financial intra-day data. We compare our results to theoretical predictions and similar empirical findings on the asymptotic distribution of model selection for consistent and inconsistent IC.
- Feb 17 2017 stat.ME arXiv:1702.04755v1Precision medicine is an emerging scientific topic for disease treatment and prevention that takes into account individual patient characteristics. It is an important direction for clinical research, and many statistical methods have been recently proposed. One of the primary goals of precision medicine is to obtain an optimal individual treatment rule (ITR), which can help make decisions on treatment selection according to each patient's specific characteristics. Recently, outcome weighted learning (OWL) has been proposed to estimate such an optimal ITR in a binary treatment setting by maximizing the expected clinical outcome. However, for ordinal treatment settings, such as individualized dose finding, it is unclear how to use OWL. In this paper, we propose a new technique for estimating ITR with ordinal treatments. In particular, we propose a data duplication technique with a piecewise convex loss function. We establish Fisher consistency for the resulting estimated ITR under certain conditions, and obtain the convergence and risk bound properties. Simulated examples and two applications to datasets from an irritable bowel problem and a type 2 diabetes mellitus observational study demonstrate the highly competitive performance of the proposed method compared to existing alternatives.
- Online learning with streaming data in a distributed and collaborative manner can be useful in a wide range of applications. This topic has been receiving considerable attention in recent years with emphasis on both single-task and multitask scenarios. In single-task adaptation, agents cooperate to track an objective of common interest, while in multitask adaptation agents track multiple objectives simultaneously. Regularization is one useful technique to promote and exploit similarity among tasks in the latter scenario. This work examines an alternative way to model relations among tasks by assuming that they all share a common latent feature representation. As a result, a new multitask learning formulation is presented and algorithms are developed for its solution in a distributed online manner. We present a unified framework to analyze the mean-square-error performance of the adaptive strategies, and conduct simulations to illustrate the theoretical findings and potential applications.
- The restricted Boltzmann machine (RBM) is one of the fundamental building blocks of deep learning. RBM finds wide applications in dimensional reduction, feature extraction, and recommender systems via modeling the probability distributions of a variety of input data including natural images, speech signals, and customer ratings, etc. We build a bridge between RBM and tensor network states (TNS) widely used in quantum many-body physics research. We devise efficient algorithms to translate an RBM into the commonly used TNS. Conversely, we give sufficient and necessary conditions to determine whether a TNS can be transformed into an RBM of given architectures. Revealing these general and constructive connections can cross-fertilize both deep learning and quantum many-body physics. Notably, by exploiting the entanglement entropy bound of TNS, we can rigorously quantify the expressive power of RBM on complex data sets. Insights into TNS and its entanglement capacity can guide the design of more powerful deep learning architectures. On the other hand, RBM can represent quantum many-body states with fewer parameters compared to TNS, which may allow more efficient classical simulations.
- Jan 05 2017 stat.AP arXiv:1701.00900v1Cooperative geolocation has attracted significant research interests in recent years. A large number of localization algorithms rely on the availability of statistical knowledge of measurement errors, which is often difficult to obtain in practice. Compared with the statistical knowledge of measurement errors, it can often be easier to obtain the measurement error bound. This work investigates a localization problem assuming unknown measurement error distribution except for a bound on the error. We first formulate this localization problem as an optimization problem to minimize the worst-case estimation error, which is shown to be a non-convex optimization problem. Then, relaxation is applied to transform it into a convex one. Furthermore, we propose a distributed algorithm to solve the problem, which will converge in a few iterations. Simulation results show that the proposed algorithms are more robust to large measurement errors than existing algorithms in the literature. Geometrical analysis providing additional insights is also provided.
- Dec 22 2016 stat.ME arXiv:1612.07072v1This article addresses the problem of efficient Bayesian inference in dynamic systems using particle methods and makes a number of contributions. First, we develop a correlated pseudo-marginal (CPM) approach for Bayesian inference in state space (SS) models that is based on filtering the disturbances, rather than the states. This approach is useful when the state transition density is intractable or inefficient to compute, and also when the dimension of the disturbance is lower than the dimension of the state. Second, we propose a block pseudo-marginal (BPM) method that uses as the estimate of the likelihood the average of G independent unbiased estimates of the likelihood. We associate a set of underlying uniform of standard normal random numbers used to construct each of the individual unbiased likelihood estimates and then use component-wise Markov Chain Monte Carlo to update the parameter vector jointly with one set of these random numbers at a time. This induces a correlation of approximately 1-1/G between the logs of the estimated likelihood at the proposed and current values of the model parameters. Third, we show for some non-stationary state space models that the BPM approach is much more efficient than the CPM approach, because it is difficult to translate the high correlation in the underlying random numbers to high correlation between the logs of the likelihood estimates. Although our focus has been on applying the BPM method to state space models, our results and approach can be used in a wide range of applications of the PM method, such as panel data models, subsampling problems and approximate Bayesian computation.
- A novel approach termed \emphstochastic truncated amplitude flow (STAF) is developed to reconstruct an unknown $n$-dimensional real-/complex-valued signal $\bm{x}$ from $m$ `phaseless' quadratic equations of the form $\psi_i=|\langle\bm{a}_i,\bm{x}\rangle|$. This problem, also known as phase retrieval from magnitude-only information, is \emphNP-hard in general. Adopting an amplitude-based nonconvex formulation, STAF leads to an iterative solver comprising two stages: s1) Orthogonality-promoting initialization through a stochastic variance reduced gradient algorithm; and, s2) A series of iterative refinements of the initialization using stochastic truncated gradient iterations. Both stages involve a single equation per iteration, thus rendering STAF a simple, scalable, and fast approach amenable to large-scale implementations that is useful when $n$ is large. When $\{\bm{a}_i\}_{i=1}^m$ are independent Gaussian, STAF provably recovers exactly any $\bm{x}\in\mathbb{R}^n$ exponentially fast based on order of $n$ quadratic equations. STAF is also robust in the presence of additive noise of bounded support. Simulated tests involving real Gaussian $\{\bm{a}_i\}$ vectors demonstrate that STAF empirically reconstructs any $\bm{x}\in\mathbb{R}^n$ exactly from about $2.3n$ magnitude-only measurements, outperforming state-of-the-art approaches and narrowing the gap from the information-theoretic number of equations $m=2n-1$. Extensive experiments using synthetic data and real images corroborate markedly improved performance of STAF over existing alternatives.
- Positive-definite kernel functions are fundamental elements of kernel methods and Gaussian processes. A well-known construction of such functions comes from Bochner's characterization, which connects a positive-definite function with a probability distribution. Another construction, which appears to have attracted less attention, is Polya's criterion that characterizes a subset of these functions. In this paper, we study the latter characterization and derive a number of novel kernels little known previously. In the context of large-scale kernel machines, Rahimi and Recht (2007) proposed a random feature map (random Fourier) that approximates a kernel function, through independent sampling of the probability distribution in Bochner's characterization. The authors also suggested another feature map (random binning), which, although not explicitly stated, comes from Polya's characterization. We show that with the same number of random samples, the random binning map results in an Euclidean inner product closer to the kernel than does the random Fourier map. The superiority of the random binning map is confirmed empirically through regressions and classifications in the reproducing kernel Hilbert space.
- Oct 20 2016 stat.ME arXiv:1610.05809v1Factors such as climate change, forest fire and plague of insects, lead to concerns on the mechanical strength of plantation materials. To address such concerns, these products must be closely monitored. This leads to the need of updating lumber quality monitoring procedures in American Society for Testing and Materials (ASTM) Standard D1990 (adopted in 1991) from time to time. A key component of monitoring is an effective method for detecting the change in lower percentiles of the solid lumber strength based on multiple samples. In a recent study by Verrill et al.\ (2015), eight statistical tests proposed by wood scientists were examined thoroughly based on real and simulated data sets. These tests are found unsatisfactory in differing aspects such as seriously inflated false alarm rate when observations are clustered, suboptimal power properties, or having inconvenient ad hoc rejection regions. A contributing factor behind suboptimal performance is that most of these tests are not developed to detect the change in quantiles. In this paper, we use a nonparametric random effects model to handle the within cluster correlations, composite empirical likelihood to avoid explicit modelling of the correlations structure, and a density ratio model to combine the information from multiple samples. In addition, we propose a cluster-based bootstrapping procedure to construct the monitoring test on quantiles which satisfactorily controls the type I error in the presence of within cluster correlation. The performance of the test is examined through simulation experiments and a real world example. The new method is generally applicable, not confined to the motivating example.
- Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets and a large number of topics. Though distributed CPU systems have been used, GPU-based systems have emerged as a promising alternative because of the high computational power and memory bandwidth of GPUs. However, existing GPU-based LDA systems cannot support a large number of topics because they use algorithms on dense data structures whose time and space complexity is linear to the number of topics. In this paper, we propose SaberLDA, a GPU-based LDA system that implements a sparsity-aware algorithm to achieve sublinear time complexity and scales well to learn a large number of topics. To address the challenges introduced by sparsity, we propose a novel data layout, a new warp-based sampling kernel, and an efficient sparse count matrix updating algorithm that improves locality, makes efficient utilization of GPU warps, and reduces memory consumption. Experiments show that SaberLDA can learn from billions-token-scale data with up to 10,000 topics, which is almost two orders of magnitude larger than that of the previous GPU-based systems. With a single GPU card, SaberLDA is able to learn 10,000 topics from a dataset of billions of tokens in a few hours, which is only achievable with clusters with tens of machines before.
- There is a growing interest in joint multi-subject fMRI analysis. The challenge of such analysis comes from inherent anatomical and functional variability across subjects. One approach to resolving this is a shared response factor model. This assumes a shared and time synchronized stimulus across subjects. Such a model can often identify shared information, but it may not be able to pinpoint with high resolution the spatial location of this information. In this work, we examine a searchlight based shared response model to identify shared information in small contiguous regions (searchlights) across the whole brain. Validation using classification tasks demonstrates that we can pinpoint informative local regions.
- Finding the most effective way to aggregate multi-subject fMRI data is a long-standing and challenging problem. It is of increasing interest in contemporary fMRI studies of human cognition due to the scarcity of data per subject and the variability of brain anatomy and functional response across subjects. Recent work on latent factor models shows promising results in this task but this approach does not preserve spatial locality in the brain. We examine two ways to combine the ideas of a factor model and a searchlight based analysis to aggregate multi-subject fMRI data while preserving spatial locality. We first do this directly by combining a recent factor method known as a shared response model with searchlight analysis. Then we design a multi-view convolutional autoencoder for the same task. Both approaches preserve spatial locality and have competitive or better performance compared with standard searchlight analysis and the shared response model applied across the whole brain. We also report a system design to handle the computational challenge of training the convolutional autoencoder.
- We propose a novel class of kernels to alleviate the high computational cost of large-scale nonparametric learning with kernel methods. The proposed kernel is defined based on a hierarchical partitioning of the underlying data domain, where the Nyström method (a globally low-rank approximation) is married with a locally lossless approximation in a hierarchical fashion. The kernel maintains (strict) positive-definiteness. The corresponding kernel matrix admits a recursively off-diagonal low-rank structure, which allows for fast linear algebra computations. Suppressing the factor of data dimension, the memory and arithmetic complexities for training a regression or a classifier are reduced from $O(n^2)$ and $O(n^3)$ to $O(nr)$ and $O(nr^2)$, respectively, where $n$ is the number of training examples and $r$ is the rank on each level of the hierarchy. Although other randomized approximate kernels entail a similar complexity, empirical results show that the proposed kernel achieves a matching performance with a smaller $r$. We demonstrate comprehensive experiments to show the effective use of the proposed kernel on data sizes up to the order of millions.
- The large-sample properties of likelihood-based statistical inference under mixture models have received much attention from statisticians. Although the consistency of the nonparametric MLE is regarded as a standard conclusion, many researchers ignore the precise conditions required on the mixture model. An incorrect claim of consistency can lead to false conclusions even if the mixture model under investigation seems well behaved. Under a finite normal mixture model, for instance, the consistency of the plain MLE is often erroneously assumed in spite of recent research breakthroughs. This paper streamlines the consistency results for the nonparametric MLE in general, and in particular for the penalized MLE under finite normal mixture models.
- Jun 03 2016 stat.ML arXiv:1606.00832v1We propose a nonconvex estimator for joint multivariate regression and precision matrix estimation in the high dimensional regime, under sparsity constraints. A gradient descent algorithm with hard thresholding is developed to solve the nonconvex estimator, and it attains a linear rate of convergence to the true regression coefficients and precision matrix simultaneously, up to the statistical error. Compared with existing methods along this line of research, which have little theoretical guarantee, the proposed algorithm not only is computationally much more efficient with provable convergence guarantee, but also attains the optimal finite sample statistical rate up to a logarithmic factor. Thorough experiments on both synthetic and real datasets back up our theory.
- Feb 22 2016 stat.ML arXiv:1602.06049v1Dynamic topic models (DTMs) are very effective in discovering topics and capturing their evolution trends in time series data. To do posterior inference of DTMs, existing methods are all batch algorithms that scan the full dataset before each update of the model and make inexact variational approximations with mean-field assumptions. Due to a lack of a more scalable inference algorithm, despite the usefulness, DTMs have not captured large topic dynamics. This paper fills this research void, and presents a fast and parallelizable inference algorithm using Gibbs Sampling with Stochastic Gradient Langevin Dynamics that does not make any unwarranted assumptions. We also present a Metropolis-Hastings based $O(1)$ sampler for topic assignments for each word token. In a distributed environment, our algorithm requires very little communication between workers during sampling (almost embarrassingly parallel) and scales up to large-scale applications. We are able to learn the largest Dynamic Topic Model to our knowledge, and learned the dynamics of 1,000 topics from 2.6 million documents in less than half an hour, and our empirical results show that our algorithm is not only orders of magnitude faster than the baselines but also achieves lower perplexity.
- Streaming variational Bayes (SVB) is successful in learning LDA models in an online manner. However previous attempts toward developing online Monte-Carlo methods for LDA have little success, often by having much worse perplexity than their batch counterparts. We present a streaming Gibbs sampling (SGS) method, an online extension of the collapsed Gibbs sampling (CGS). Our empirical study shows that SGS can reach similar perplexity as CGS, much better than SVB. Our distributed version of SGS, DSGS, is much more scalable than SVB mainly because the updates' communication complexity is small.
- Developing efficient and scalable algorithms for Latent Dirichlet Allocation (LDA) is of wide interest for many applications. Previous work has developed an O(1) Metropolis-Hastings sampling method for each token. However, the performance is far from being optimal due to random accesses to the parameter matrices and frequent cache misses. In this paper, we first carefully analyze the memory access efficiency of existing algorithms for LDA by the scope of random access, which is the size of the memory region in which random accesses fall, within a short period of time. We then develop WarpLDA, an LDA sampler which achieves both the best O(1) time complexity per token and the best O(K) scope of random access. Our empirical results in a wide range of testing conditions demonstrate that WarpLDA is consistently 5-15x faster than the state-of-the-art Metropolis-Hastings based LightLDA, and is comparable or faster than the sparsity aware F+LDA. With WarpLDA, users can learn up to one million topics from hundreds of millions of documents in a few hours, at an unprecedentedly throughput of 11G tokens per second.
- We consider the problem of computing a positive definite $p \times p$ inverse covariance matrix aka precision matrix $\theta=(\theta_{ij})$ which optimizes a regularized Gaussian maximum likelihood problem, with the elastic-net regularizer $\sum_{i,j=1}^{p} \lambda (\alpha|\theta_{ij}| + \frac{1}{2}(1- \alpha) \theta_{ij}^2),$ with regularization parameters $\alpha \in [0,1]$ and $\lambda>0$. The associated convex semidefinite optimization problem is notoriously difficult to scale to large problems and has demanded significant attention over the past several years. We propose a new algorithmic framework based on stochastic proximal optimization (on the primal problem) that can be used to obtain near optimal solutions with substantial computational savings over deterministic algorithms. A key challenge of our work stems from the fact that the optimization problem being investigated does not satisfy the usual assumptions required by stochastic gradient methods. Our proposal has (a) computational guarantees and (b) scales well to large problems, even if the solution is not too sparse; thereby, enhancing the scope of regularized maximum likelihood problems to many large-scale problems of contemporary interest. An important aspect of our proposal is to bypass the \emphdeterministic computation of a matrix inverse by drawing random samples from a suitable multivariate Gaussian distribution.
- In this article, we study a partially linear single-index model for longitudinal data under a general framework which includes both the sparse and dense longitudinal data cases. A semiparametric estimation method based on a combination of the local linear smoothing and generalized estimation equations (GEE) is introduced to estimate the two parameter vectors as well as the unknown link function. Under some mild conditions, we derive the asymptotic properties of the proposed parametric and nonparametric estimators in different scenarios, from which we find that the convergence rates and asymptotic variances of the proposed estimators for sparse longitudinal data would be substantially different from those for dense longitudinal data. We also discuss the estimation of the covariance (or weight) matrices involved in the semiparametric GEE method. Furthermore, we provide some numerical studies including Monte Carlo simulation and an empirical application to illustrate our methodology and theory.
- We propose the convex factorization machine (CFM), which is a convex variant of the widely used Factorization Machines (FMs). Specifically, we employ a linear+quadratic model and regularize the linear term with the $\ell_2$-regularizer and the quadratic term with the trace norm regularizer. Then, we formulate the CFM optimization as a semidefinite programming problem and propose an efficient optimization procedure with Hazan's algorithm. A key advantage of CFM over existing FMs is that it can find a globally optimal solution, while FMs may get a poor locally optimal solution since the objective function of FMs is non-convex. In addition, the proposed algorithm is simple yet effective and can be implemented easily. Finally, CFM is a general factorization method and can also be used for other factorization problems including including multi-view matrix factorization and tensor completion problems. Through synthetic and movielens datasets, we first show that the proposed CFM achieves results competitive to FMs. Furthermore, in a toxicogenomics prediction task, we show that CFM outperforms a state-of-the-art tensor factorization method.
- Apr 21 2015 stat.ME arXiv:1504.04935v1For testing the independence of two vectors with respective dimensions $p_1$ and $p_2$, the existing literature in high-dimensional statistics all assume that both dimensions $p_1$ and $p_2$ grow to infinity with the sample size. However, as evidenced in the RNA-sequencing data analysis discussed in the paper, it happens frequently that one of the dimension is quite small and the other quite large compared to the sample size. In this paper, we address this new asymptotic framework for the independence test. A new test procedure is introduced and its asymptotic normality is established when the vectors are normal distributed. A Mote-Carlo study demonstrates the consistency of the procedure and exhibits its superiority over some existing high-dimensional procedures. Applied to the RNA-sequencing data mentioned above, we obtain very convincing results on pairwise independence/dependence of gene isoform expressions as attested by prior knowledge established in that field. Lastly, Monte-Carlo experiments show that the procedure is robust against the normality assumption on the population vectors.
- Explosive growth in data and availability of cheap computing resources have sparked increasing interest in Big learning, an emerging subfield that studies scalable machine learning algorithms, systems, and applications with Big Data. Bayesian methods represent one important class of statistic methods for machine learning, with substantial recent developments on adaptive, flexible and scalable Bayesian learning. This article provides a survey of the recent advances in Big learning with Bayesian methods, termed Big Bayesian Learning, including nonparametric Bayesian methods for adaptively inferring model complexity, regularized Bayesian inference for improving the flexibility via posterior regularization, and scalable algorithms and systems based on stochastic subsampling and distributed computing for dealing with large-scale applications.
- The expressive power of a Gaussian process (GP) model comes at a cost of poor scalability in the data size. To improve its scalability, this paper presents a low-rank-cum-Markov approximation (LMA) of the GP model that is novel in leveraging the dual computational advantages stemming from complementing a low-rank approximate representation of the full-rank GP based on a support set of inputs with a Markov approximation of the resulting residual process; the latter approximation is guaranteed to be closest in the Kullback-Leibler distance criterion subject to some constraint and is considerably more refined than that of existing sparse GP models utilizing low-rank representations due to its more relaxed conditional independence assumption (especially with larger data). As a result, our LMA method can trade off between the size of the support set and the order of the Markov property to (a) incur lower computational cost than such sparse GP models while achieving predictive performance comparable to them and (b) accurately represent features/patterns of any scale. Interestingly, varying the Markov order produces a spectrum of LMAs with PIC approximation and full-rank GP at the two extremes. An advantage of our LMA method is that it is amenable to parallelization on multiple machines/cores, thereby gaining greater scalability. Empirical evaluation on three real-world datasets in clusters of up to 32 computing nodes shows that our centralized and parallel LMA methods are significantly more time-efficient and scalable than state-of-the-art sparse and full-rank GP regression methods while achieving comparable predictive performances.
- Gaussian processes (GP) are Bayesian non-parametric models that are widely used for probabilistic regression. Unfortunately, it cannot scale well with large data nor perform real-time predictions due to its cubic time cost in the data size. This paper presents two parallel GP regression methods that exploit low-rank covariance matrix approximations for distributing the computational load among parallel machines to achieve time efficiency and scalability. We theoretically guarantee the predictive performances of our proposed parallel GPs to be equivalent to that of some centralized approximate GP regression methods: The computation of their centralized counterparts can be distributed among parallel machines, hence achieving greater time efficiency and scalability. We analytically compare the properties of our parallel GPs such as time, space, and communication complexity. Empirical evaluation on two real-world datasets in a cluster of 20 computing nodes shows that our parallel GPs are significantly more time-efficient and scalable than their centralized counterparts and exact/full GP while achieving predictive performances comparable to full GP.
- Apr 30 2014 stat.OT arXiv:1404.7208v2Sample-average approximations (SAA) are a practical means of finding approximate solutions of stochastic programming problems involving an extremely large (or infinite) number of scenarios. SAA can also be used to find estimates of a lower bound on the optimal objective value of the true problem which, when coupled with an upper bound, provides confidence intervals for the true optimal objective value and valuable information about the quality of the approximate solutions. Specifically, the lower bound can be estimated by solving multiple SAA problems (each obtained using a particular sampling method) and averaging the obtained objective values. State-of-the-art methods for lower-bound estimation generate batches of scenarios for the SAA problems independently. In this paper, we describe sampling methods that produce negatively dependent batches, thus reducing the variance of the sample-averaged lower bound estimator and increasing its usefulness in defining a confidence interval for the optimal objective value. We provide conditions under which the new sampling methods can reduce the variance of the lower bound estimator, and present computational results to verify that our scheme can reduce the variance significantly, by comparison with the traditional Latin hypercube approach.
- Central to robot exploration and mapping is the task of persistent localization in environmental fields characterized by spatially correlated measurements. This paper presents a Gaussian process localization (GP-Localize) algorithm that, in contrast to existing works, can exploit the spatially correlated field measurements taken during a robot's exploration (instead of relying on prior training data) for efficiently and scalably learning the GP observation model online through our proposed novel online sparse GP. As a result, GP-Localize is capable of achieving constant time and memory (i.e., independent of the size of the data) per filtering step, which demonstrates the practical feasibility of using GPs for persistent robot localization and autonomy. Empirical evaluation via simulated experiments with real-world datasets and a real robot experiment shows that GP-Localize outperforms existing GP localization algorithms.
- Dec 11 2013 stat.AP arXiv:1312.2687v1We discuss the statistical properties of a recently introduced unbiased stochastic approximation to the score equations for maximum likelihood calculation for Gaussian processes. Under certain conditions, including bounded condition number of the covariance matrix, the approach achieves $O(n)$ storage and nearly $O(n)$ computational effort per optimization step, where $n$ is the number of data sites. Here, we prove that if the condition number of the covariance matrix is bounded, then the approximate score equations are nearly optimal in a well-defined sense. Therefore, not only is the approximation efficient to compute, but it also has comparable statistical properties to the exact maximum likelihood estimates. We discuss a modification of the stochastic approximation in which design elements of the stochastic terms mimic patterns from a $2^n$ factorial design. We prove these designs are always at least as good as the unstructured design, and we demonstrate through simulation that they can produce a substantial improvement over random designs. Our findings are validated by numerical experiments on simulated data sets of up to 1 million observations. We apply the approach to fit a space-time model to over 80,000 observations of total column ozone contained in the latitude band $40^{\circ}$-$50^{\circ}$N during April 2012.
- Nov 01 2013 stat.ML arXiv:1310.8612v1Incorporating spatial information into hyperspectral unmixing procedures has been shown to have positive effects, due to the inherent spatial-spectral duality in hyperspectral scenes. Current research works that consider spatial information are mainly focused on the linear mixing model. In this paper, we investigate a variational approach to incorporating spatial correlation into a nonlinear unmixing procedure. A nonlinear algorithm operating in reproducing kernel Hilbert spaces, associated with an $\ell_1$ local variation norm as the spatial regularizer, is derived. Experimental results, with both synthetic and real data, illustrate the effectiveness of the proposed scheme.
- Nov 01 2013 stat.ML arXiv:1310.8618v1The kernel least-mean-square (KLMS) algorithm is an appealing tool for online identification of nonlinear systems due to its simplicity and robustness. In addition to choosing a reproducing kernel and setting filter parameters, designing a KLMS adaptive filter requires to select a so-called dictionary in order to get a finite-order model. This dictionary has a significant impact on performance, and requires careful consideration. Theoretical analysis of KLMS as a function of dictionary setting has rarely, if ever, been addressed in the literature. In an analysis previously published by the authors, the dictionary elements were assumed to be governed by the same probability density function of the input data. In this paper, we modify this study by considering the dictionary as part of the filter parameters to be set. This theoretical analysis paves the way for future investigations on KLMS dictionary design.
- This paper presents a hypothesis testing method given independent samples from a number of connected populations. The method is motivated by a forestry project for monitoring change in the strength of lumber. Traditional practice has been built upon nonparametric methods which ignore the fact that these populations are connected. By pooling the information in multiple samples through a density ratio model, the proposed empirical likelihood method leads to a more efficient inference and therefore reduces the cost in applications. The new test has a classical chi-square null limiting distribution. Its power function is obtained under a class of local alternatives. The local power is found increased even when some underlying populations are unrelated to the hypothesis of interest. Simulation studies confirm that this test has better power properties than potential competitors, and is robust to model misspecification. An application example to lumber strength is included.
- Population quantiles and their functions are important parameters in many applications. For example, the lower quantiles often serve as crucial quality indices for forestry products. Given several independent samples from populations satisfying the density ratio model, we investigate the properties of empirical likelihood (EL) based inferences. The induced EL quantile estimators are shown to admit a Bahadur representation that leads to asymptotically valid confidence intervals for functions of quantiles. We rigorously prove that EL quantiles based on all the samples are more efficient than empirical quantiles based on individual samples. A simulation study shows that the EL quantiles and their functions have superior performance when the density ratio model assumption is satisfied and when it is mildly violated. An example is used to demonstrate the new method and the potential cost savings.