Methodology (stat.ME)

  • PDF
    Boolean matrix factorisation (BooMF) infers interpretable decompositions of a binary data matrix into a pair of low-rank, binary matrices: One containing meaningful patterns, the other quantifying how the observations can be expressed as a combination of these patterns. We introduce the OrMachine, a probabilistic generative model for BooMF and derive a Metropolised Gibbs sampler that facilitates very efficient parallel posterior inference. Our method outperforms all currently existing approaches for Boolean Matrix factorization and completion, as we show on simulated and real world data. This is the first method to provide full posterior inference for BooMF which is relevant in applications, e.g. for controlling false positive rates in collaborative filtering, and crucially it improves the interpretability of the inferred patterns. The proposed algorithm scales to large datasets as we demonstrate by analysing single cell gene expression data in 1.3 million mouse brain cells across 11,000 genes on commodity hardware.
  • PDF
    This paper provides asymptotic theory for Inverse Probability Weighing (IPW) and Locally Robust Estimator (LRE) of Best Linear Predictor where the response missing at random (MAR), but not completely at random (MCAR). We relax previous assumptions in the literature about the first-step nonparametric components, requiring only their mean square convergence. This relaxation allows to use a wider class of machine leaning methods for the first-step, such as lasso. For a generic first-step, IPW incurs a first-order bias unless the model it approximates is truly linear in the predictors. In contrast, LRE remains first-order unbiased provided one can estimate the conditional expectation of the response with sufficient accuracy. An additional novelty is allowing the dimension of Best Linear Predictor to grow with sample size. These relaxations are important for estimation of best linear predictor of teacher-specific and hospital-specific effects with large number of individuals.
  • PDF
    Eigenvector spatial filtering (ESF) is a spatial modeling approach, which has been applied in urban and regional studies, ecological studies, and so on. However, it is computationally demanding, and may not be suitable for large data modeling. The objective of this study is developing fast ESF and random effects ESF (RE-ESF), which are capable of handling very large samples. To achieve it, we accelerate eigen-decomposition and parameter estimation, which make ESF and RE-ESF slow. The former is accelerated by utilizing the Nyström extension, whereas the latter is by small matrix tricks. The resulting fast ESF and fast RE-ESF are compared with non-approximated ESF and RE-ESF in Mote Carlo simulation experiments. The result shows that, while ESF and RE-ESF are slow for several thousand sample size, fast ESF and RE-ESF require only several minutes even for 500,000 sample size. It is also verified that their approximation errors are very small. We subsequently apply fast ESF and RE-ESF approaches to a land price analysis.
  • PDF
    Consider the problem of modeling hysteresis for finite-state random walks using higher-order Markov chains. This Letter introduces a Bayesian framework to determine, from data, the number of prior states of recent history upon which a trajectory is statistically dependent. The general recommendation is to use leave-one-out cross validation, using an easily-computable formula that is provided in closed form. Importantly, Bayes factors using flat model priors are biased in favor of too-complex a model (more hysteresis) when a large amount of data is present and the Akaike information criterion (AIC) is biased in favor of too-sparse a model (less hysteresis) when few data are present.