There is a lack of methodological results to design efficient Markov chain Monte Carlo (MCMC) algorithms for statistical models with discrete-valued high-dimensional parameters. Motivated by this consideration, we propose a simple framework for the design of informed MCMC proposals (i.e. Metropolis-Hastings proposal distributions that appropriately incorporate local information about the target) which is naturally applicable to both discrete and continuous spaces. We explicitly characterize the class of optimal proposal distributions under this framework, which we refer to as locally-balanced proposals, and prove their Peskun-optimality in high-dimensional regimes. The resulting algorithms are straightforward to implement in discrete spaces and provide orders of magnitude improvements in efficiency compared to alternative MCMC schemes, including discrete versions of Hamiltonian Monte Carlo. Simulations are performed with both simulated and real datasets, including a detailed application to Bayesian record linkage. A direct connection with gradient-based MCMC suggests that locally-balanced proposals may be seen as a natural way to extend the latter to discrete spaces.
Nov 21 2017 stat.CO
In this work we present a non-reversible, tuning- and rejection-free Markov chain Monte Carlo which naturally fits in the framework of hit-and-run. The sampler only requires access to the gradient of the log-density function, hence the normalizing constant is not needed. We prove the proposed Markov chain is invariant for the target distribution and illustrate its applicability through a wide range of examples. We show that the sampler introduced in the present paper is intimately related to the continuous sampler of Peters and de With (2012), Bouchard-Cote et al. (2017). In particular, the computation is quite similar in the sense that both are centered around simulating an inhomogenuous Poisson process. The computation can be simplified when the gradient of the log-density admits a computationally efficient directional decomposition into a sum of two monotone functions. We apply our sampler in selective inference, gaining significant improvement over the formerly used sampler (Tian et al. 2016).
Quadratic approximations of logistic log-likelihoods are fundamental to facilitate estimation and inference for binary variables. Although classical expansions underlying Newton-Raphson and Fisher scoring methods have attracted much of the interest, there has been also a recent focus on quadratic bounds that uniformly minorize the logistic log-likelihood, and are tangent to it in a specific point. Compared to the classical Taylor expansion of the score function, these approximations provide iterative estimation procedures which guarantee monotonicity in the log-likelihood sequence, and motivate variational methods for Bayesian inference. A relevant contribution, within this class of approximations, relies on a convex duality argument to derive a tractable family of tangent quadratic expansions indexed by a location parameter. Although this approximation is widely used in practice, less attempts have been made to understand its probabilistic justification and the associated properties. To address this gap, we formally relate this quadratic lower bound to a recent Pólya-gamma data augmentation, showing that the approximation error associated with the bound coincides with the Kullback-Leibler divergence between a generic Pólya-gamma variable and the one obtained by conditioning on the observed response data. This result facilitates the study of the optimality properties associated with the minorize-majorize and variational Bayes routines leveraging this quadratic bound, and motivates a novel mean-field variational Bayes for logistic regression.
Distributed algorithms are often beset by the straggler effect, where the slowest compute nodes in the system dictate the overall running time. Coding-theoretic techniques have been recently proposed to mitigate stragglers via algorithmic redundancy. Prior work in coded computation and gradient coding has mainly focused on exact recovery of the desired output. However, slightly inexact solutions can be acceptable in applications that are robust to noise, such as model training via gradient-based algorithms. In this work, we present computationally simple gradient codes based on sparse graphs that guarantee fast and approximately accurate distributed computation. We demonstrate that sacrificing a small amount of accuracy can significantly increase algorithmic robustness to stragglers.
Markov Chain Monte Carlo (MCMC) methods such as Gibbs sampling are finding widespread use in applied statistics and machine learning. These often lead to difficult computational problems, which are increasingly being solved on parallel and distributed systems such as compute clusters. Recent work has proposed running iterative algorithms such as gradient descent and MCMC in parallel asynchronously for increased performance, with good empirical results in certain problems. Unfortunately, for MCMC this parallelization technique requires new convergence theory, as it has been explicitly demonstrated to lead to divergence on some examples. Recent theory on Asynchronous Gibbs sampling describes why these algorithms can fail, and provides a way to alter them to make them converge. In this article, we describe how to apply this theory in a generic setting, to understand the asynchronous behavior of any MCMC algorithm, including those implemented using parameter servers, and those not based on Gibbs sampling.
Nov 21 2017 stat.CO
Genetic algorithms are a widely used method in chemometrics for extracting variable subsets with high prediction power. Most fitness measures used by these genetic algorithms are based on the ordinary least-squares fit of the resulting model to the entire data or a subset thereof. Due to multicollinearity, partial least squares regression is often more appropriate, but rarely considered in genetic algorithms due to the additional cost for estimating the optimal number of components. We introduce two novel fitness measures for genetic algorithms, explicitly designed to estimate the internal prediction performance of partial least squares regression models built from the variable subsets. Both measures estimate the optimal number of components using cross-validation and subsequently estimate the prediction performance by predicting the response of observations not included in model-fitting. This is repeated multiple times to estimate the measures' variations due to different random splits. Moreover, one measure was optimized for speed and more accurate estimation of the prediction performance for observations not included during variable selection. This leads to variable subsets with high internal and external prediction power. Results on high-dimensional chemical-analytical data show that the variable subsets acquired by this approach have competitive internal prediction power and superior external prediction power compared to variable subsets extracted with other fitness measures.