Jul 25 2017

stat.ME arXiv:1707.07560v1

We consider structure learning of linear Gaussian structural equation models with weak edges. Since the presence of weak edges can lead to a loss of edge orientations in the true underlying CPDAG, we define a new graphical object that can contain more edge orientations. We show that this object can be recovered from observational data under a type of strong faithfulness assumption. We present a new algorithm for this purpose, called aggregated greedy equivalence search (AGES), that aggregates the solution path of the greedy equivalence search (GES) algorithm for varying values of the penalty parameter. We prove consistency of AGES and demonstrate its performance in a simulation study and on single cell data from Sachs et al. (2005). The algorithm will be made available in the R-package pcalg.

Jul 25 2017

stat.ME arXiv:1707.07506v1

In this paper we propose a principal component Liu-type logistic estimator by combining the principal component logistic regression estimator and Liu-type logistic estimator to overcome the multicollinearity problem. The superiority of the new estimator over some related estimators are studied under the asymptotic mean squared error matrix. A Monte Carlo simulation experiment is designed to compare the performances of the estimators using mean squared error criterion. Finally, a conclusion section is presented.

Jul 25 2017

stat.ME arXiv:1707.07332v1

With the rapid growth of modern technology, many large-scale biomedical studies have been/are being/will be conducted to collect massive datasets with large volumes of multi-modality imaging, genetic, neurocognitive, and clinical information from increasingly large cohorts. Simultaneously extracting and integrating rich and diverse heterogeneous information in neuroimaging and/or genomics from these big datasets could transform our understanding of how genetic variants impact brain structure and function, cognitive function, and brain-related disease risk across the lifespan. Such understanding is critical for diagnosis, prevention, and treatment of numerous complex brain-related disorders (e.g., schizophrenia and Alzheimer). However, the development of analytical methods for the joint analysis of both high-dimensional imaging phenotypes and high-dimensional genetic data, called big data squared (BD$^2$), presents major computational and theoretical challenges for existing analytical methods. Besides the high-dimensional nature of BD$^2$, various neuroimaging measures often exhibit strong spatial smoothness and dependence and genetic markers may have a natural dependence structure arising from linkage disequilibrium. We review some recent developments of various statistical techniques for the joint analysis of BD$^2$, including massive univariate and voxel-wise approaches, reduced rank regression, mixture models, and group sparse multi-task regression. By doing so, we hope that this review may encourage others in the statistical community to enter into this new and exciting field of research.

In this paper, we introduce the models of permutations with bias, which are random permutations of a set, biased by some preference values. We present a new parametric test, together with an efficient way to calculate its p-value. The final tables of the English and Spanish major soccer leagues are tested according to this new procedure, to discover whether these results were aligned with expectations.

Jul 25 2017

stat.ME arXiv:1707.07215v1

Multistage design has been used in a wide range of scientific fields. By allocating sensing resources adaptively, one can effectively eliminate null locations and localize signals with a smaller study budget. We formulate a decision-theoretic framework for simultaneous multi- stage adaptive testing and study how to minimize the total number of measurements while meeting pre-specified constraints on both the false positive rate (FPR) and missed discovery rate (MDR). The new procedure, which effectively pools information across individual tests using a simultaneous multistage adaptive ranking and thresholding (SMART) approach, can achieve precise error rates control and lead to great savings in total study costs. Numerical studies confirm the effectiveness of SMART for FPR and MDR control and show that it achieves substantial power gain over existing methods. The SMART procedure is demonstrated through the analysis of high-throughput screening data and spatial imaging data.

Prediction rule ensembles (PREs) are sparse collections of rules, offering highly interpretable regression and classification models. This paper presents the R package pre, which derives PREs through the methodology of Friedman and Popescu (2008). The implementation and functionality of package pre is described and illustrated through application on a dataset on the prediction of depression. Furthermore, accuracy and sparsity of PREs is compared with that of single trees, random forest and lasso regression in four benchmark datasets. Results indicate that pre derives ensembles with predictive accuracy comparable to that of random forests, while using a smaller number of variables for prediction.