# Statistics (stat)

• This paper presents an estimator for semiparametric models that uses a feed-forward neural network to fit the nonparametric component. Unlike many methodologies from the machine learning literature, this approach is suitable for longitudinal/panel data. It provides unbiased estimation of the parametric component of the model, with associated confidence intervals that have near-nominal coverage rates. It is further shown that this model and estimator nests a nonparametric heterogeneous treatment effects model and estimator, which can consistently estimate individualized treatment effects conditional on covariates. Simulations demonstrate (1) efficiency, (2) that parametric estimates are unbiased, and (3) coverage properties of estimated intervals. An application section demonstrates the method by predicting county-level corn yield using daily weather data from the period 1981-2015, along with parametric time trends representing technological change. The method is shown to out-perform linear methods such as OLS and ridge/lasso, as well as random forest. The procedures described in this paper are implemented in the R package panelNNET.
• Feb 22 2017 cs.LG stat.ML arXiv:1702.06295v1
Initialization of parameters in deep neural networks has been shown to have a big impact on the performance of the networks (Mishkin & Matas, 2015). The initialization scheme devised by He et al, allowed convolution activations to carry a constrained mean which allowed deep networks to be trained effectively (He et al., 2015a). Orthogonal initializations and more generally orthogonal matrices in standard recurrent networks have been proved to eradicate the vanishing and exploding gradient problem (Pascanu et al., 2012). Majority of current initialization schemes do not take fully into account the intrinsic structure of the convolution operator. This paper introduces a new type of initialization built around the duality of the Fourier transform and the convolution operator. With Convolution Aware Initialization we noticed not only higher accuracy and lower loss, but faster convergence in general. We achieve new state of the art on the CIFAR10 dataset, and achieve close to state of the art on various other tasks.
• We obtain the first polynomial-time algorithm for exact tensor completion that improves over the bound implied by reduction to matrix completion. The algorithm recovers an unknown 3-tensor with $r$ incoherent, orthogonal components in $\mathbb R^n$ from $r\cdot \tilde O(n^{1.5})$ randomly observed entries of the tensor. This bound improves over the previous best one of $r\cdot \tilde O(n^{2})$ by reduction to exact matrix completion. Our bound also matches the best known results for the easier problem of approximate tensor completion (Barak & Moitra, 2015). Our algorithm and analysis extends seminal results for exact matrix completion (Candes & Recht, 2009) to the tensor setting via the sum-of-squares method. The main technical challenge is to show that a small number of randomly chosen monomials are enough to construct a degree-3 polynomial with a precisely planted orthogonal global optima over the sphere and that this fact can be certified within the sum-of-squares proof system.
• This paper addresses tracking of a moving target in a multi-agent network. The target follows a linear dynamics corrupted by an adversarial noise, i.e., the noise is not generated from a statistical distribution. The location of the target at each time induces a global time-varying loss function, and the global loss is a sum of local losses, each of which is associated to one agent. Agents noisy observations could be nonlinear. We formulate this problem as a distributed online optimization where agents communicate with each other to track the minimizer of the global loss. We then propose a decentralized version of the Mirror Descent algorithm and provide the non-asymptotic analysis of the problem. Using the notion of dynamic regret, we measure the performance of our algorithm versus its offline counterpart in the centralized setting. We prove that the bound on dynamic regret scales inversely in the network spectral gap, and it represents the adversarial noise causing deviation with respect to the linear dynamics. Our result subsumes a number of results in the distributed optimization literature. Finally, in a numerical experiment, we verify that our algorithm can be simply implemented for multi-agent tracking with nonlinear observations.
• Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.
• Boolean matrix factorisation (BooMF) infers interpretable decompositions of a binary data matrix into a pair of low-rank, binary matrices: One containing meaningful patterns, the other quantifying how the observations can be expressed as a combination of these patterns. We introduce the OrMachine, a probabilistic generative model for BooMF and derive a Metropolised Gibbs sampler that facilitates very efficient parallel posterior inference. Our method outperforms all currently existing approaches for Boolean Matrix factorization and completion, as we show on simulated and real world data. This is the first method to provide full posterior inference for BooMF which is relevant in applications, e.g. for controlling false positive rates in collaborative filtering, and crucially it improves the interpretability of the inferred patterns. The proposed algorithm scales to large datasets as we demonstrate by analysing single cell gene expression data in 1.3 million mouse brain cells across 11,000 genes on commodity hardware.
• We study the problem of low-rank plus sparse matrix recovery. We propose a generic and efficient nonconvex optimization algorithm based on projected gradient descent and double thresholding operator, with much lower computational complexity. Compared with existing convex-relaxation based methods, the proposed algorithm recovers the low-rank plus sparse matrices for free, without incurring any additional statistical cost. It not only enables exact recovery of the unknown low-rank and sparse matrices in the noiseless setting, and achieves minimax optimal statistical error rate in the noisy case, but also matches the best-known robustness guarantee (i.e., tolerance for sparse corruption). At the core of our theory is a novel structural Lipschitz gradient condition for low-rank plus sparse matrices, which is essential for proving the linear convergence rate of our algorithm, and we believe is of independent interest to prove fast rates for general superposition-structured models. We demonstrate the superiority of our generic algorithm, both theoretically and experimentally, through three concrete applications: robust matrix sensing, robust PCA and one-bit matrix decomposition.
• A number of fundamental quantities in statistical signal processing and information theory can be expressed as integral functions of two probability density functions. Such quantities are called density functionals as they map density functions onto the real line. For example, information divergence functions measure the dissimilarity between two probability density functions and are particularly useful in a number of applications. Typically, estimating these quantities requires complete knowledge of the underlying distribution followed by multi-dimensional integration. Existing methods make parametric assumptions about the data distribution or use non-parametric density estimation followed by high-dimensional integration. In this paper, we propose a new alternative. We introduce the concept of "data-driven" basis functions - functions of distributions whose value we can estimate given only samples from the underlying distributions without requiring distribution fitting or direct integration. We derive a new data-driven complete basis that is similar to the deterministic Bernstein polynomial basis and develop two methods for performing basis expansions of functionals of two distributions. We also show that the new basis set allows us to approximate functions of distributions as closely as desired. Finally, we evaluate the methodology by developing data driven estimators for the Kullback-Leibler divergences and the Hellinger distance and by constructing tight data-driven bounds on the Bayes Error Rate.
• Principal component analysis (PCA) is fundamental to statistical machine learning. It extracts latent principal factors that contribute to the most variation of the data. When data are stored across multiple machines, however, communication cost can prohibit the computation of PCA in a central location and distributed algorithms for PCA are thus needed. This paper proposes and studies a distributed PCA algorithm: each node machine computes the top $K$ eigenvectors and transmits them to the central server; the central server then aggregates the information from all the node machines and conducts a PCA based on the aggregated information. We investigate the bias and variance for the resulting distributed estimator of the top $K$ eigenvectors. In particular, we show that for distributions with symmetric innovation, the distributed PCA is "unbiased". We derive the rate of convergence for distributed PCA estimators, which depends explicitly on the effective rank of covariance, eigen-gap, and the number of machines. We show that when the number of machines is not unreasonably large, the distributed PCA performs as well as the whole sample PCA, even without full access of whole data. The theoretical results are verified by an extensive simulation study. We also extend our analysis to the heterogeneous case where the population covariance matrices are different across local machines but share similar top eigen-structures.
• Two of the most fundamental prototypes of greedy optimization are the matching pursuit and Frank-Wolfe algorithms. In this paper, we take a unified view on both classes of methods, leading to the first explicit convergence rates of matching pursuit methods in an optimization sense, for general sets of atoms. We derive sublinear ($1/t$) convergence for both classes on general smooth objectives, and linear convergence on strongly convex objectives, as well as a clear correspondence of algorithm variants. Our presented algorithms and rates are affine invariant, and do not need any incoherence or sparsity assumptions.
• We study a spectral initialization method that serves as a key ingredient in recent work on using efficient iterative algorithms for estimating signals in nonconvex settings. Unlike previous analysis in the literature, which is restricted to the phase retrieval setting and which provides only performance bounds, we consider arbitrary generalized linear sensing models and present a precise asymptotic characterization of the performance of the spectral method in the high-dimensional regime. Our analysis reveals a phase transition phenomenon that depends on the sampling ratio. When the ratio is below a minimum threshold, the estimates given by the spectral method are no better than a random guess drawn uniformly from the hypersphere; above a maximum threshold, however, the estimates become increasingly aligned with the target signal. The computational complexity of the spectral method is also markedly different in the two phases. Worked examples and numerical results are provided to illustrate and verify the analytical predictions. In particular, simulations show that our asymptotic formulas provide accurate predictions even at moderate signal dimensions.
• We consider the minimization of composite objective functions composed of the expectation of quadratic functions and an arbitrary convex function. We study the stochastic dual averaging algorithm with a constant step-size, showing that it leads to a convergence rate of O(1/n) without strong convexity assumptions. This thus extends earlier results on least-squares regression with the Euclidean geometry to (a) all convex regularizers and constraints, and (b) all geome-tries represented by a Bregman divergence. This is achieved by a new proof technique that relates stochastic and deterministic recursions.
• The R package frailtySurv for simulating and fitting semi-parametric shared frailty models is introduced. frailtySurv implements semi-parametric consistent estimators for a variety of frailty distributions, including gamma, log-normal, inverse Gaussian and power variance function, and provides consistent estimators of the standard errors of the parameters' estimators. The parameters' estimators are asymptotically normally distributed, and therefore statistical inference based on the results of this package, such as hypothesis testing and confidence intervals, can be performed using the normal distribution. Extensive simulations demonstrate the flexibility and correct implementation of the estimator. Two case studies performed with publicly-available datasets demonstrate applicability of the package. In the Diabetic Retinopathy Study, the onset of blindness is clustered by patient, and in a large hard drive failure dataset, failure times are thought to be clustered by the hard drive manufacturer and model.
• Given data over the joint distribution of two univariate or multivariate random variables $X$ and $Y$ of mixed or single type data, we consider the problem of inferring the most likely causal direction between $X$ and $Y$. We take an information theoretic approach, from which it follows that first describing the data over cause and then that of effect given cause is shorter than the reverse direction. For practical inference, we propose a score for causal models for mixed type data based on the Minimum Description Length (MDL) principle. In particular, we model dependencies between $X$ and $Y$ using classification and regression trees. Inferring the optimal model is NP-hard, and hence we propose Crack, a fast greedy algorithm to infer the most likely causal direction directly from the data. Empirical evaluation on synthetic, benchmark, and real world data shows that Crack reliably and with high accuracy infers the correct causal direction on both univariate and multivariate cause--effect pairs over both single and mixed type data.
• We propose an inlier-based outlier detection method capable of both identifying the outliers and explaining why they are outliers, by identifying the outlier-specific features. Specifically, we employ an inlier-based outlier detection criterion, which uses the ratio of inlier and test probability densities as a measure of plausibility of being an outlier. For estimating the density ratio function, we propose a localized logistic regression algorithm. Thanks to the locality of the model, variable selection can be outlier-specific, and will help interpret why points are outliers in a high-dimensional space. Through synthetic experiments, we show that the proposed algorithm can successfully detect the important features for outliers. Moreover, we show that the proposed algorithm tends to outperform existing algorithms in benchmark datasets.
• We study the problem of online learning in a class of Markov decision processes known as linearly solvable MDPs. In the stationary version of this problem, a learner interacts with its environment by directly controlling the state transitions, attempting to balance a fixed state-dependent cost and a certain smooth cost penalizing extreme control inputs. In the current paper, we consider an online setting where the state costs may change arbitrarily between consecutive rounds, and the learner only observes the costs at the end of each respective round. We are interested in constructing algorithms for the learner that guarantee small regret against the best stationary control policy chosen in full knowledge of the cost sequence. Our main result is showing that the smoothness of the control cost enables the simple algorithm of following the leader to achieve a regret of order $\log^2 T$ after $T$ rounds, vastly improving on the best known regret bound of order $T^{3/4}$ for this setting.
• In this note we answer a question of G. Lecué, by showing that column normalization of a random matrix with iid entries need not lead to good sparse recovery properties, even if the generating random variable has a reasonable moment growth. Specifically, for every $2 \leq p \leq c_1\log d$ we construct a random vector $X \in R^d$ with iid, mean-zero, variance $1$ coordinates, that satisfies $\sup_{t \in S^{d-1}} \|<X,t>\|_{L_q} \leq c_2\sqrt{q}$ for every $2\leq q \leq p$. We show that if $m \leq c_3\sqrt{p}d^{1/p}$ and $\tilde{\Gamma}:R^d \to R^m$ is the column-normalized matrix generated by $m$ independent copies of $X$, then with probability at least $1-2\exp(-c_4m)$, $\tilde{\Gamma}$ does not satisfy the exact reconstruction property of order $2$.
• This paper provides asymptotic theory for Inverse Probability Weighing (IPW) and Locally Robust Estimator (LRE) of Best Linear Predictor where the response missing at random (MAR), but not completely at random (MCAR). We relax previous assumptions in the literature about the first-step nonparametric components, requiring only their mean square convergence. This relaxation allows to use a wider class of machine leaning methods for the first-step, such as lasso. For a generic first-step, IPW incurs a first-order bias unless the model it approximates is truly linear in the predictors. In contrast, LRE remains first-order unbiased provided one can estimate the conditional expectation of the response with sufficient accuracy. An additional novelty is allowing the dimension of Best Linear Predictor to grow with sample size. These relaxations are important for estimation of best linear predictor of teacher-specific and hospital-specific effects with large number of individuals.
• Many statistical learning problems can be posed as minimization of sum of two convex functions, one typically non-smooth. Popular algorithms for solving such problems, e.g., ADMM, often involve non-trivial optimization subproblems or smoothing approximation. We study two classes of algorithms that do not incur these difficulties, and unify them from a perspective of monotone operator theory. The result is a class of preconditioned forward-backward algorithms with a novel family of preconditioners. We analyze convergence of the whole class of algorithms, and obtain their rates of convergence for the range of algorithm parameters where convergence is known but rates have been missing. We demonstrate the scalability of our algorithm class with a distributed implementation.
• Eigenvector spatial filtering (ESF) is a spatial modeling approach, which has been applied in urban and regional studies, ecological studies, and so on. However, it is computationally demanding, and may not be suitable for large data modeling. The objective of this study is developing fast ESF and random effects ESF (RE-ESF), which are capable of handling very large samples. To achieve it, we accelerate eigen-decomposition and parameter estimation, which make ESF and RE-ESF slow. The former is accelerated by utilizing the Nyström extension, whereas the latter is by small matrix tricks. The resulting fast ESF and fast RE-ESF are compared with non-approximated ESF and RE-ESF in Mote Carlo simulation experiments. The result shows that, while ESF and RE-ESF are slow for several thousand sample size, fast ESF and RE-ESF require only several minutes even for 500,000 sample size. It is also verified that their approximation errors are very small. We subsequently apply fast ESF and RE-ESF approaches to a land price analysis.
• Consider the problem of modeling hysteresis for finite-state random walks using higher-order Markov chains. This Letter introduces a Bayesian framework to determine, from data, the number of prior states of recent history upon which a trajectory is statistically dependent. The general recommendation is to use leave-one-out cross validation, using an easily-computable formula that is provided in closed form. Importantly, Bayes factors using flat model priors are biased in favor of too-complex a model (more hysteresis) when a large amount of data is present and the Akaike information criterion (AIC) is biased in favor of too-sparse a model (less hysteresis) when few data are present.
• Since the events of the Arab Spring, there has been increased interest in using social media to anticipate social unrest. While efforts have been made toward automated unrest prediction, we focus on filtering the vast volume of tweets to identify tweets relevant to unrest, which can be provided to downstream users for further analysis. We train a supervised classifier that is able to label Arabic language tweets as relevant to unrest with high reliability. We examine the relationship between training data size and performance and investigate ways to optimize the model building process while minimizing cost. We also explore how confidence thresholds can be set to achieve desired levels of performance.
• Hypothesis tests in models whose dimension far exceeds the sample size can be formulated much like the classical studentized tests only after the initial bias of estimation is removed successfully. The theory of debiased estimators can be developed in the context of quantile regression models for a fixed quantile value. However, it is frequently desirable to formulate tests based on the quantile regression process, as this leads to more robust tests and more stable confidence sets. Additionally, inference in quantile regression requires estimation of the so called sparsity function, which depends on the unknown density of the error. In this paper we consider a debiasing approach for the uniform testing problem. We develop high-dimensional regression rank scores and show how to use them to estimate the sparsity function, as well as how to adapt them for inference involving the quantile regression process. Furthermore, we develop a Kolmogorov-Smirnov test in a location-shift high-dimensional models and confidence sets that are uniformly valid for many quantile values. The main technical result are the development of a Bahadur representation of the debiasing estimator that is uniform over a range of quantiles and uniform convergence of the quantile process to the Brownian bridge process, which are of independent interest. Simulation studies illustrate finite sample properties of our procedure.
• This paper concerns the problem of recovering an unknown but structured signal $x \in R^n$ from $m$ quadratic measurements of the form $y_r=|<a_r,x>|^2$ for $r=1,2,...,m$. We focus on the under-determined setting where the number of measurements is significantly smaller than the dimension of the signal ($m<<n$). We formulate the recovery problem as a nonconvex optimization problem where prior structural information about the signal is enforced through constrains on the optimization variables. We prove that projected gradient descent, when initialized in a neighborhood of the desired signal, converges to the unknown signal at a linear rate. These results hold for any constraint set (convex or nonconvex) providing convergence guarantees to the global optimum even when the objective function and constraint set is nonconvex. Furthermore, these results hold with a number of measurements that is only a constant factor away from the minimal number of measurements required to uniquely identify the unknown signal. Our results provide the first provably tractable algorithm for this data-poor regime, breaking local sample complexity barriers that have emerged in recent literature. In a companion paper we demonstrate favorable properties for the optimization problem that may enable similar results to continue to hold more globally (over the entire ambient space). Collectively these two papers utilize and develop powerful tools for uniform convergence of empirical processes that may have broader implications for rigorous understanding of constrained nonconvex optimization heuristics. The mathematical results in this paper also pave the way for a new generation of data-driven phase-less imaging systems that can utilize prior information to significantly reduce acquisition time and enhance image reconstruction, enabling nano-scale imaging at unprecedented speeds and resolutions.

Noon van der Silk Jan 27 2016 03:39 UTC

Great institute name ...

Alessandro Dec 09 2015 01:12 UTC

Hey, I've already seen this title! http://arxiv.org/abs/1307.0401

Chris Granade Sep 22 2015 19:15 UTC

Thank you for the kind comments, I'm glad that our paper, source code, and tutorial are useful!

Travis Scholten Sep 21 2015 17:05 UTC

This was a really well-written paper! Am very glad to see this kind of work being done.

In addition, the openness about source code is refreshing. By explicitly relating the work to [QInfer](https://github.com/csferrie/python-qinfer), this paper makes it more easy to check the authors' work. Furthe

...(continued)
Chris Granade Sep 15 2015 02:40 UTC

I fell for that clickbait title and read the paper. I still don’t get why von Neumann didn't want us to know about this weird trick? And which weird trick? The use of superfidelity or the use of non-physical density matrices like $\sigma^\sharp$?