Sep 22 2017

stat.ME arXiv:1709.07064v1

The Gaussian process is a standard tool for building emulators for both deterministic and stochastic computer experiments. However, application of Gaussian process models is greatly limited in practice, particularly for large-scale and many-input computer experiments that have become typical. We propose a multi-resolution functional ANOVA model as a computationally feasible emulation alternative. More generally, this model can be used for large-scale and many-input non-linear regression problems. An overlapping group lasso approach is used for estimation, ensuring computational feasibility in a large-scale and many-input setting. New results on consistency and inference for the (potentially overlapping) group lasso in a high-dimensional setting are developed and applied to the proposed multi-resolution functional ANOVA model. Importantly, these results allow us to quantify the uncertainty in our predictions. Numerical examples demonstrate that the proposed model enjoys marked computational advantages. Data capabilities, both in terms of sample size and dimension, meet or exceed best available emulation tools while meeting or exceeding emulation accuracy.

Sep 22 2017

stat.ME arXiv:1709.07339v1

Fisherian randomization inference is often dismissed as testing an uninteresting and implausible hypothesis: the sharp null of no effects whatsoever. We show that this view is overly narrow. Many randomization tests are also valid under a more general "bounded" null hypothesis under which all effects are weakly negative (or positive), thus accommodating heterogenous effects. By inverting such tests we can form one-sided confidence intervals for the maximum (or minimum) effect. These properties hold for all effect-increasing test statistics, which include both common statistics such as the mean difference and uncommon ones such as Stephenson rank statistics. The latter's sensitivity to extreme effects permits detection of positive effects even when the average effect is negative. We argue that bounded nulls are often of substantive or theoretical interest, and illustrate with two applications: testing monotonicity in an IV analysis and inferring effect sizes in a small randomized experiment.

Sep 22 2017

stat.ME arXiv:1709.07238v1

Factors are categorical variables, and the values which these variables assume are called levels. In this paper, we consider the variable selection problem where the set of potential predictors contains both factors and numerical variables. Formally, this problem is a particular case of the standard variable selection problem where factors are coded using dummy variables. As such, the Bayesian solution would be straightforward and, possibly because of this, the problem, despite its importance, has not received much attention in the literature. Nevertheless, we show that this perception is illusory and that in fact several inputs like the assignment of prior probabilities over the model space or the parameterization adopted for factors may have a large (and difficult to anticipate) impact on the results. We provide a solution to these issues that extends the proposals in the standard variable selection problem and does not depend on how the factors are coded using dummy variables. Our approach is illustrated with a real example concerning a childhood obesity study in Spain.

Sep 22 2017

stat.ME arXiv:1709.07045v1

The Minimum Covariance Determinant (MCD) method is a highly robust estimator of multivariate location and scatter, for which a fast algorithm is available. Since estimating the covariance matrix is the cornerstone of many multivariate statistical methods, the MCD is an important building block when developing robust multivariate techniques. It also serves as a convenient and efficient tool for outlier detection. The MCD estimator is reviewed, along with its main properties such as affine equivariance, breakdown value, and influence function. We discuss its computation, and list applications and extensions of the MCD in applied and methodological multivariate statistics. Two recent extensions of the MCD are described. The first one is a fast deterministic algorithm which inherits the robustness of the MCD while being almost affine equivariant. The second is tailored to high-dimensional data, possibly with more dimensions than cases, and incorporates regularization to prevent singular matrices.

We develop a new modeling framework for Inter-Subject Analysis (ISA). The goal of ISA is to explore the dependency structure between different subjects with the intra-subject dependency as nuisance. It has important applications in neuroscience to explore the functional connectivity between brain regions under natural stimuli. Our framework is based on the Gaussian graphical models, under which ISA can be converted to the problem of estimation and inference of the inter-subject precision matrix. The main statistical challenge is that we do not impose sparsity constraint on the whole precision matrix and we only assume the inter-subject part is sparse. For estimation, we propose to estimate an alternative parameter to get around the non-sparse issue and it can achieve asymptotic consistency even if the intra-subject dependency is dense. For inference, we propose an "untangle and chord" procedure to de-bias our estimator. It is valid without the sparsity assumption on the inverse Hessian of the log-likelihood function. This inferential method is general and can be applied to many other statistical problems, thus it is of independent theoretical interest. Numerical experiments on both simulated and brain imaging data validate our methods and theory.