We establish a tight characterization of the worst-case rates for the excess risk of agnostic learning with sample compression schemes and for uniform convergence for agnostic sample compression schemes. In particular, we find that the optimal rates of convergence for size-$k$ agnostic sample compression schemes are of the form $\sqrt{\frac{k \log(n/k)}{n}}$, which contrasts with agnostic learning with classes of VC dimension $k$, where the optimal rates are of the form $\sqrt{\frac{k}{n}}$.

When confronted with massive data streams, summarizing data with dimension reduction methods such as PCA raises theoretical and algorithmic pitfalls. Principal curves act as a nonlinear generalization of PCA and the present paper proposes a novel algorithm to automatically and sequentially learn principal curves from data streams. We show that our procedure is supported by regret bounds with optimal sublinear remainder terms. A greedy local search implementation that incorporates both sleeping experts and multi-armed bandit ingredients is presented, along with its regret bound and performance on a toy example and seismic data.

Motivated by the construction of robust estimators using the convex relaxation paradigm, known to be computationally efficient, we present some conditions on the sample size which guarantee an augmented notion of Restricted Eigenvalue-type condition for Gaussian designs. Such notion is suitable for the construction of robust estimators of a multivariate Gaussian model whose samples are corrupted by outliers.

Instrumental variable is an essential tool for addressing unmeasured confounding in observational studies. Two stage predictor substitution (2SPS) estimator and two stage residual inclusion(2SRI) are two commonly used approaches in applying instrumental variables. Recently 2SPS was studied under the Aalen additive hazards model in the presence of competing risks of time-to-events data, where linearity was assumed for the relation between the treatment and the instrument variable. This assumption may not be the most appropriate when we have binary treatments. In this paper, we consider the 2SRI estimator under the Aalen additive hazards model for general survival data and in the presence of competing risks, which allows generalized linear models for the relation between the treatment and the instrumental variable. We derive the asymptotic properties including a closed-form asymptotic variance estimate for the 2SRI estimator. We carry out numerical studies in finite samples, and apply our methodology to the linked Surveillance, Epidemiology and End Results (SEER) - Medicare database comparing radical prostatectomy versus conservative treatment in early-stage prostate cancer patients.

We develop large sample theory for merged data from multiple sources. Main statistical issues treated in this paper are (1) the same unit potentially appears in multiple datasets from overlapping data sources, (2) duplicated items are not identified, and (3) a sample from the same data source is dependent due to sampling without replacement. We propose and study a new weighted empirical process and extend empirical process theory to a dependent and biased sample with duplication. Specifically, we establish the uniform law of large numbers and uniform central limit theorem over a class of functions along with several empirical process results under conditions identical to those in the i.i.d. setting. As applications, we study infinite-dimensional M-estimation and develop its consistency, rates of convergence, and asymptotic normality. Our theoretical results are illustrated with simulation studies and a real data example.

We study uniqueness in the generalized lasso problem, where the penalty is the $\ell_1$ norm of a matrix $D$ times the coefficient vector. We derive a broad result on uniqueness that places weak assumptions on the predictor matrix $X$ and penalty matrix $D$; the implication is that, if $D$ is fixed and its null space is not too large (the dimension of its null space is at most the number of samples), and $X$ and response vector $y$ jointly follow an absolutely continuous distribution, then the generalized lasso problem has a unique solution almost surely, regardless of the number of predictors relative to the number of samples. This effectively generalizes previous uniqueness results for the lasso problem (which corresponds to the special case $D=I$). Further, we extend our study to the case in which the loss is given by the negative log-likelihood from a generalized linear model. In addition to uniqueness results, we derive results on the local stability of generalized lasso solutions that might be of interest in their own right.

We study high-dimensional M-estimators with the trimmed $\ell_1$ penalty. While standard $\ell_1$ penalty incurs bias (shrinkage), trimmed $\ell_1$ leaves the $h$ largest entries penalty-free. This family of estimators include the Trimmed Lasso for sparse linear regression and its counterpart for sparse graphical model estimation. The trimmed $\ell_1$ penalty is non-convex, but unlike other non-convex regularizers such as SCAD and MCP, it is not amenable and therefore prior analyzes cannot be applied. We characterize the support recovery of the estimates as a function of the trimming parameter $h$. Under certain conditions, we show that for any local optimum, (i) if the trimming parameter $h$ is smaller than the true support size, all zero entries of the true parameter vector are successfully estimated as zero, and (ii) if $h$ is larger than the true support size, the non-relevant parameters of the local optimum have smaller absolute values than relevant parameters and hence relevant parameters are not penalized. We then bound the $\ell_2$ error of any local optimum. These bounds are asymptotically comparable to those for non-convex amenable penalties such as SCAD or MCP, but enjoy better constants. We specialize our main results to linear regression and graphical model estimation. Finally, we develop a fast provably convergent optimization algorithm for the trimmed regularizer problem. The algorithm has the same rate of convergence as difference of convex (DC)-based approaches, but is faster in practice and finds better objective values than recently proposed algorithms for DC optimization. Empirical results further demonstrate the value of $\ell_1$ trimming.