# Methodology (stat.ME)

• Principal manifolds are used to represent high-dimensional data in a low-dimensional space. They are high-dimensional generalizations of principal curves and surfaces. The existing methods for fitting principal manifolds have several shortcomings: model bias, heavy computational burden, sensitivity to outliers, and difficulty of use in applications. We propose a novel method for modeling principal manifolds that addresses these limitations. It is based on minimization of penalized mean squared error functionals, providing a nonlinear summary of the data points in Euclidean spaces. We introduce the framework in the context of principal manifolds of middles and develop an estimate by proposing a high-dimensional mixture density estimation procedure. The Sobolev embedding theorem guarantees the regularity of the derived manifolds and analytical expressions of the embedding maps are obtained. The algorithm is computationally efficient and robust to outliers. We used simulation studies to illustrate the comparative performance of the proposed method in low-dimensions and found that it performs better than competitors. In addition, we analyze computed tomography images of lung cancer tumors focusing on two important clinical questions - estimation of the tumor surface and identification of tumor interior classifier. We used the obtained analytic expressions of embedding maps to construct a tumor interior classifier.
• Assuming stationarity is unrealistic in many time series applications. A more realistic alternative is to allow for piecewise stationarity, where the model is allowed to change at given time points. We propose a three-stage procedure for consistent estimation of both structural change points and parameters of high-dimensional piecewise vector autoregressive (VAR) models. In the first step, we reformulate the change point detection problem as a high-dimensional variable selection one, and propose a penalized least square estimator using a total variation penalty. We show that the proposed penalized estimation method over-estimates the number of change points. We then propose a backward selection criterion in conjunction with a penalized least square estimator to tackle this issue. In the last step of our procedure, we estimate the VAR parameters in each of the segments. We prove that the proposed procedure consistently detects the number of change points and their locations. We also show that the procedure consistently estimates the VAR parameters. The performance of the method is illustrated through several simulation scenarios and real data examples.
• Many popular random partition models, such as the Chinese restaurant process and its two-parameter extension, fall in the class of exchangeable random partitions, and have found wide applicability in model-based clustering, population genetics, ecology or network analysis. While the exchangeability assumption is sensible in many cases, it has some strong implications. In particular, Kingman's representation theorem implies that the size of the clusters necessarily grows linearly with the sample size; this feature may be undesirable for some applications, as recently pointed out by Miller et al. (2015). We present here a flexible class of non-exchangeable random partition models which are able to generate partitions whose cluster sizes grow sublinearly with the sample size, and where the growth rate is controlled by one parameter. Along with this result, we provide the asymptotic behaviour of the number of clusters of a given size, and show that the model can exhibit a power-law behavior, controlled by another parameter. The construction is based on completely random measures and a Poisson embedding of the random partition, and inference is performed using a Sequential Monte Carlo algorithm. Additionally, we show how the model can also be directly used to generate sparse multigraphs with power-law degree distributions and degree sequences with sublinear growth. Finally, experiments on real datasets emphasize the usefulness of the approach compared to a two-parameter Chinese restaurant process.
• Nov 21 2017 stat.ME arXiv:1711.07137v1
Use of nonparametric techniques (e.g., machine learning, kernel smoothing, stacking) are increasingly appealing because they do not require precise knowledge of the true underlying models that generated the data under study. Indeed, numerous authors have advocated for their use with standard methods (e.g., regression, inverse probability weighting) in epidemiology. However, when used in the context of such singly robust approaches, nonparametric methods can lead to suboptimal statistical properties, including inefficiency and no valid confidence intervals. Using extensive Monte Carlo simulations, we show how doubly robust methods offer improvements over singly robust approaches when implemented via nonparametric methods. We use 10,000 simulated samples and 50, 100, 200, 600, and 1200 observations to investigate the bias and mean squared error of singly robust (g Computation, inverse probability weighting) and doubly robust (augmented inverse probability weighting, targeted maximum likelihood estimation) estimators under four scenarios: correct and incorrect model specification; and parametric and nonparametric estimation. As expected, results show best performance with g computation under correctly specified parametric models. However, even when based on complex transformed covariates, double robust estimation performs better than singly robust estimators when nonparametric methods are used. Our results suggest that nonparametric methods should be used with doubly instead of singly robust estimation techniques.
• Quadratic approximations of logistic log-likelihoods are fundamental to facilitate estimation and inference for binary variables. Although classical expansions underlying Newton-Raphson and Fisher scoring methods have attracted much of the interest, there has been also a recent focus on quadratic bounds that uniformly minorize the logistic log-likelihood, and are tangent to it in a specific point. Compared to the classical Taylor expansion of the score function, these approximations provide iterative estimation procedures which guarantee monotonicity in the log-likelihood sequence, and motivate variational methods for Bayesian inference. A relevant contribution, within this class of approximations, relies on a convex duality argument to derive a tractable family of tangent quadratic expansions indexed by a location parameter. Although this approximation is widely used in practice, less attempts have been made to understand its probabilistic justification and the associated properties. To address this gap, we formally relate this quadratic lower bound to a recent Pólya-gamma data augmentation, showing that the approximation error associated with the bound coincides with the Kullback-Leibler divergence between a generic Pólya-gamma variable and the one obtained by conditioning on the observed response data. This result facilitates the study of the optimality properties associated with the minorize-majorize and variational Bayes routines leveraging this quadratic bound, and motivates a novel mean-field variational Bayes for logistic regression.
• For a general class of priors based on random series basis expansion, we develop the Bayes Lepski's method to estimate unknown regression function. In this approach, the series truncation point is determined based on a stopping rule that balances the posterior mean bias and the posterior standard deviation. Equipped with this mechanism, we present a method to construct adaptive Bayesian credible bands, where this statistical task is reformulated into a problem in geometry, and the band's radius is computed based on finding the volume of certain tubular neighborhood embedded on a unit sphere. We consider two special cases involving B-splines and wavelets, and discuss some interesting consequences such as the uncertainty principle and self-similarity. Lastly, we show how to program the Bayes Lepski stopping rule on a computer, and numerical simulations in conjunction with our theoretical investigations concur that this is a promising Bayesian uncertainty quantification procedure.
• We propose an optimal sequential methodology for obtaining confidence intervals for a binomial proportion $\theta$. Assuming that an i.i.d. random sequence of Benoulli($\theta$) trials is observed sequentially, we are interested in designing a)~a stopping time $T$ that will decide when is the best time to stop sampling the process, and b)~an optimum estimator $\hat{\theta}_{T}$ that will provide the optimum center of the interval estimate of $\theta$. We follow a semi-Bayesian approach, where we assume that there exists a prior distribution for $\theta$, and our goal is to minimize the average number of samples while we guarantee a minimal coverage probability level. The solution is obtained by applying standard optimal stopping theory and computing the optimum pair $(T,\hat{\theta}_{T})$ numerically. Regarding the optimum stopping time component $T$, we demonstrate that it enjoys certain very uncommon characteristics not encountered in solutions of other classical optimal stopping problems. Finally, we compare our method with the optimum fixed-sample-size procedure but also with existing alternative sequential schemes.
• We revisit the classification problem and focus on nonlinear methods for classification on manifolds. For multivariate datasets lying on an embedded nonlinear Riemannian manifold within the higher-dimensional space, our aim is to acquire a classification boundary between the classes with labels. Motivated by the principal flow [Panaretos, Pham and Yao, 2014], a curve that moves along a path of the maximum variation of the data, we introduce the principal boundary. From the classification perspective, the principal boundary is defined as an optimal curve that moves in between the principal flows traced out from two classes of the data, and at any point on the boundary, it maximizes the margin between the two classes. We estimate the boundary in quality with its direction supervised by the two principal flows. We show that the principal boundary yields the usual decision boundary found by the support vector machine, in the sense that locally, the two boundaries coincide. By means of examples, we illustrate how to find, use and interpret the principal boundary.