- May 29 2017 stat.ME arXiv:1705.09599v1As a competitive alternative to least squares regression, quantile regression is popular in analyzing heterogenous data. For quantile regression model specified for one single quantile level $\tau$, major difficulties of semiparametric efficient estimation are the unavailability of a parametric efficient score and the conditional density estimation. In this paper, with the help of the least favorable submodel technique, we first derive the semiparametric efficient scores for linear quantile regression models that are assumed for a single quantile level, multiple quantile levels and all the quantile levels in $(0,1)$ respectively. Our main discovery is a one-step (nearly) semiparametric efficient estimation for the regression coefficients of the quantile regression models assumed for multiple quantile levels, which has several advantages: it could be regarded as an optimal way to pool information across multiple/other quantiles for efficiency gain; it is computationally feasible and easy to implement, as the initial estimator is easily available; due to the nature of quantile regression models under investigation, the conditional density estimation is straightforward by plugging in an initial estimator. The resulting estimator is proved to achieve the corresponding semiparametric efficiency lower bound under regularity conditions. Numerical studies including simulations and an example of birth weight of children confirms that the proposed estimator leads to higher efficiency compared with the Koenker-Bassett quantile regression estimator for all quantiles of interest.
- May 29 2017 stat.ME arXiv:1705.09561v1Many data producers seek to provide users access to confidential data without unduly compromising data subjects' privacy and confidentiality. When intense redaction is needed to do so, one general strategy is to require users to do analyses without seeing the confidential data, for example, by releasing fully synthetic data or by allowing users to query remote systems for disclosure-protected outputs of statistical models. With fully synthetic data or redacted outputs, the analyst never really knows how much to trust the resulting findings. In particular, if the user did the same analysis on the confidential data, would regression coefficients of interest be statistically significant or not? We present algorithms for assessing this question that satisfy differential privacy. We describe conditions under which the algorithms should give accurate answers about statistical significance. We illustrate the properties of the methods using artificial and genuine data.
- May 29 2017 stat.ME arXiv:1705.09528v1The Bonferroni adjustment, or the union bound, is commonly used to study rate optimality properties of statistical methods in high-dimensional problems. However, in practice, the Bonferroni adjustment is overly conservative. The extreme value theory has been proven to provide more accurate multiplicity adjustments in a number of settings, but only on ad hoc basis. Recently, Gaussian approximation has been used to justify bootstrap adjustments in large scale simultaneous inference in some general settings when $n \gg (\log p)^7$, where $p$ is the multiplicity of the inference problem and $n$ is the sample size. The thrust of this theory is the validity of the Gaussian approximation for maxima of sums of independent random vectors in high-dimension. In this paper, we reduce the sample size requirement to $n \gg (\log p)^5$ for the consistency of the empirical bootstrap and the multiplier/wild bootstrap in the Kolmogorov-Smirnov distance, possibly in the regime where the Gaussian approximation is not available. New comparison and anti-concentration theorems, which are of considerable interest in and of themselves, are developed as existing ones interweaved with Gaussian approximation are no longer applicable.
- May 29 2017 stat.ME arXiv:1705.09417v1Applying standard statistical methods after model selection may yield inefficient estimators and hypothesis tests that fail to achieve nominal type-I error rates. The main issue is the fact that the post-selection distribution of the data differs from the original distribution. In particular, the observed data is constrained to lie in a subset of the original sample space that is determined by the selected model. This often makes the post-selection likelihood of the observed data intractable and maximum likelihood inference difficult. In this work, we get around the intractable likelihood by generating noisy unbiased estimates of the post-selection score function and using them in a stochastic ascent algorithm that yields correct post-selection maximum likelihood estimates. We apply the proposed technique to the problem of estimating linear models selected by the lasso. In an asymptotic analysis the resulting estimates are shown to be consistent for the selected parameters and to have a limiting truncated normal distribution. Confidence intervals constructed based on the asymptotic distribution obtain close to nominal coverage rates in all simulation settings considered, and the point estimates are shown to be superior to the lasso estimates when the true model is sparse.
- May 29 2017 stat.ME arXiv:1705.09591v1
- May 29 2017 stat.ME arXiv:1705.09464v1
- May 29 2017 stat.ME arXiv:1705.09355v1