# Methodology (stat.ME)

• As a competitive alternative to least squares regression, quantile regression is popular in analyzing heterogenous data. For quantile regression model specified for one single quantile level $\tau$, major difficulties of semiparametric efficient estimation are the unavailability of a parametric efficient score and the conditional density estimation. In this paper, with the help of the least favorable submodel technique, we first derive the semiparametric efficient scores for linear quantile regression models that are assumed for a single quantile level, multiple quantile levels and all the quantile levels in $(0,1)$ respectively. Our main discovery is a one-step (nearly) semiparametric efficient estimation for the regression coefficients of the quantile regression models assumed for multiple quantile levels, which has several advantages: it could be regarded as an optimal way to pool information across multiple/other quantiles for efficiency gain; it is computationally feasible and easy to implement, as the initial estimator is easily available; due to the nature of quantile regression models under investigation, the conditional density estimation is straightforward by plugging in an initial estimator. The resulting estimator is proved to achieve the corresponding semiparametric efficiency lower bound under regularity conditions. Numerical studies including simulations and an example of birth weight of children confirms that the proposed estimator leads to higher efficiency compared with the Koenker-Bassett quantile regression estimator for all quantiles of interest.
• Many data producers seek to provide users access to confidential data without unduly compromising data subjects' privacy and confidentiality. When intense redaction is needed to do so, one general strategy is to require users to do analyses without seeing the confidential data, for example, by releasing fully synthetic data or by allowing users to query remote systems for disclosure-protected outputs of statistical models. With fully synthetic data or redacted outputs, the analyst never really knows how much to trust the resulting findings. In particular, if the user did the same analysis on the confidential data, would regression coefficients of interest be statistically significant or not? We present algorithms for assessing this question that satisfy differential privacy. We describe conditions under which the algorithms should give accurate answers about statistical significance. We illustrate the properties of the methods using artificial and genuine data.
• The Bonferroni adjustment, or the union bound, is commonly used to study rate optimality properties of statistical methods in high-dimensional problems. However, in practice, the Bonferroni adjustment is overly conservative. The extreme value theory has been proven to provide more accurate multiplicity adjustments in a number of settings, but only on ad hoc basis. Recently, Gaussian approximation has been used to justify bootstrap adjustments in large scale simultaneous inference in some general settings when $n \gg (\log p)^7$, where $p$ is the multiplicity of the inference problem and $n$ is the sample size. The thrust of this theory is the validity of the Gaussian approximation for maxima of sums of independent random vectors in high-dimension. In this paper, we reduce the sample size requirement to $n \gg (\log p)^5$ for the consistency of the empirical bootstrap and the multiplier/wild bootstrap in the Kolmogorov-Smirnov distance, possibly in the regime where the Gaussian approximation is not available. New comparison and anti-concentration theorems, which are of considerable interest in and of themselves, are developed as existing ones interweaved with Gaussian approximation are no longer applicable.
• Applying standard statistical methods after model selection may yield inefficient estimators and hypothesis tests that fail to achieve nominal type-I error rates. The main issue is the fact that the post-selection distribution of the data differs from the original distribution. In particular, the observed data is constrained to lie in a subset of the original sample space that is determined by the selected model. This often makes the post-selection likelihood of the observed data intractable and maximum likelihood inference difficult. In this work, we get around the intractable likelihood by generating noisy unbiased estimates of the post-selection score function and using them in a stochastic ascent algorithm that yields correct post-selection maximum likelihood estimates. We apply the proposed technique to the problem of estimating linear models selected by the lasso. In an asymptotic analysis the resulting estimates are shown to be consistent for the selected parameters and to have a limiting truncated normal distribution. Confidence intervals constructed based on the asymptotic distribution obtain close to nominal coverage rates in all simulation settings considered, and the point estimates are shown to be superior to the lasso estimates when the true model is sparse.
• In genetic epidemiological studies, family history data are collected on relatives of study participants and used to estimate the age-specific risk of disease for individuals who carry a causal mutation. However, a family member's genotype data may not be collected due to the high cost of in-person interview to obtain blood sample or death of a relative. Previously, efficient nonparametric genotype-specific risk estimation in censored mixture data has been proposed without considering covariates. With multiple predictive risk factors available, risk estimation requires a multivariate model to account for additional covariates that may affect disease risk simultaneously. Therefore, it is important to consider the role of covariates in the genotype-specific distribution estimation using family history data. We propose an estimation method that permits more precise risk prediction by controlling for individual characteristics and incorporating interaction effects with missing genotypes in relatives, and thus gene-gene interactions and gene-environment interactions can be handled within the framework of a single model. We examine performance of the proposed methods by simulations and apply them to estimate the age-specific cumulative risk of Parkinson's disease (PD) in carriers of LRRK2 G2019S mutation using first-degree relatives who are at genetic risk for PD. The utility of estimated carrier risk is demonstrated through designing a future clinical trial under various assumptions. Such sample size estimation is seen in the Huntington's disease literature using the length of abnormal expansion of a CAG repeat in the HTT gene, but is less common in the PD literature.
• Graphical network inference is used in many fields such as genomics or ecology to infer the conditional independence structure between variables, from measurements of gene expression or species abundances for instance. In many practical cases, it is likely that not all variables involved in the network have been actually observed. The observed samples are therefore drawn from a distribution where some unobserved variables have been marginalized out. In this context, the inference of underlying networks is compromised because marginalization in graphical models yields locally dense structures, which challenges the assumption that biological networks are sparse. In this article we present a procedure for inferring Gaussian graphical models in the presence of unobserved variables. We provide a model based on spanning trees and the EM algorithm which accounts both for the influence of unobserved variables and the low density of the network. We treat the graph structure and the unobserved nodes as latent variables and compute posterior probabilities of edge appearance. We also propose several model selection criteria to estimate the number of missing nodes, and compare our method to existing graph inference techniques on simulations and on a flow cytometry data illustration.
• Performing statistical inference on collections of graphs is of import to many disciplines. Graph embedding, in which the vertices of a graph are mapped to vectors in a low-dimensional Euclidean space, has gained traction as a basic tool for graph analysis. In this paper, we describe an omnibus embedding in which multiple graphs on the same vertex set are jointly embedded into a single space with a distinct representation for each graph. We prove a central limit theorem for this omnibus embedding, and we show that this simultaneous embedding into a common space allows comparison of graphs without the need to perform pairwise alignments of graph embeddings. Experimental results demonstrate that the omnibus embedding improves upon existing methods, allowing better power in multiple-graph hypothesis testing and yielding better estimation in a latent position model.