results for au:Ramdas_A in:stat

- Some pitfalls of the false discovery rate (FDR) as an error criterion for multiple testing of $n$ hypotheses include (a) committing to an error level $q$ in advance limits its use in exploratory data analysis, and (b) controlling the false discovery proportion (FDP) on average provides no guarantee on its variability. We take a step towards overcoming these barriers using a new perspective we call "simultaneous selective inference." Many FDR procedures (such as Benjamini-Hochberg) can be viewed as carving out a $\textit{path}$ of potential rejection sets $\varnothing = \mathcal R_0 \subseteq \mathcal R_1 \subseteq \cdots \subseteq \mathcal R_n \subseteq [n]$, assigning some algorithm-dependent estimate $\widehat{\text{FDP}}(\mathcal R_k)$ to each one. Then, they choose $k^* = \max\{k: \widehat{\text{FDP}}(\mathcal R_k) \leq q\}$. We prove that for all these algorithms, given independent null p-values and a confidence level $\alpha$, either the same $\widehat{FDP}$ or a minor variant thereof bounds the unknown FDP to within a small explicit (algorithm-dependent) constant factor $c_{\text{alg}}(\alpha)$, uniformly across the entire path, with probability $1-\alpha$. Our bounds open up a middle ground between fully simultaneous inference (guarantees for all $2^n$ possible rejection sets), and fully selective inference (guarantees only for $\mathcal R_{k^*}$). They allow the scientist to $\textit{spot}$ one or more suitable rejection sets (Select Post-hoc On the algorithm's Trajectory), by picking data-dependent sizes or error-levels, after examining the entire path of $\widehat{\text{FDP}}(\mathcal R_k)$ and the uniform upper band on $\text{FDP}$. The price for the additional flexibility of spotting is small, for example the multiplier for BH corresponding to 95% confidence is approximately 2.
- In the online false discovery rate (FDR) problem, one observes a possibly infinite sequence of $p$-values $P_1,P_2,\dots$, each testing a different null hypothesis, and an algorithm must pick a sequence of rejection thresholds $\alpha_1,\alpha_2,\dots$ in an online fashion, effectively rejecting the $k$-th null hypothesis whenever $P_k \leq \alpha_k$. Importantly, $\alpha_k$ must be a function of the past, and cannot depend on $P_k$ or any of the later unseen $p$-values, and must be chosen to guarantee that for any time $t$, the FDR up to time $t$ is less than some pre-determined quantity $\alpha \in (0,1)$. In this work, we present a powerful new framework for online FDR control that we refer to as SAFFRON. Like older alpha-investing (AI) algorithms, SAFFRON starts off with an error budget, called alpha-wealth, that it intelligently allocates to different tests over time, earning back some wealth on making a new discovery. However, unlike older methods, SAFFRON's threshold sequence is based on a novel estimate of the alpha fraction that it allocates to true null hypotheses. In the offline setting, algorithms that employ an estimate of the proportion of true nulls are called adaptive methods, and SAFFRON can be seen as an online analogue of the famous offline Storey-BH adaptive procedure. Just as Storey-BH is typically more powerful than the Benjamini-Hochberg (BH) procedure under independence, we demonstrate that SAFFRON is also more powerful than its non-adaptive counterparts, such as LORD and other generalized alpha-investing algorithms. Further, a monotone version of the original AI algorithm is recovered as a special case of SAFFRON, that is often more stable and powerful than the original. Lastly, the derivation of SAFFRON provides a novel template for deriving new online FDR rules.
- Oct 10 2017 stat.ME arXiv:1710.02776v1We propose a general framework based on \textitselectively traversed accumulation rules (STAR) for interactive "human-in-the-loop" multiple testing with generic structural constraints on the rejection set. STAR combines accumulation tests from ordered multiple testing with data-carving ideas from post-selection inference, allowing for highly flexible adaptation to generic structural information. Given independent $p$-values for each of $n$ null hypotheses, STAR defines an iterative protocol for gradually pruning a candidate rejection set, beginning with $\mathcal{R}_0 = [n]$ and shrinking with each step. At step $t$, the analyst estimates the false discovery proportion (FDP) of the current rejection set $\mathcal{R}_t$, and halts and rejects every $H_i$ with $i \in \mathcal{R}_t$ if $\mathrm{FDP}_t \leq \alpha$. Otherwise, the analyst may shrink the rejection set to $\mathcal{R}_{t+1}\subseteq \mathcal{R}_t$ however she wants, provided the choice depends only on partially masked $p$-values $g(p_i)$ for $i\in \mathcal{R}_t$, as well as unmasked $p$-values $p_i$ for $i\notin \mathcal{R}_t$. Typically, the choice will be based on eliminating the "least promising" hypothesis from $\mathcal{R}_t$, after estimating a model from the observable data. By restricting the information available to the analyst, our iterative protocol guarantees exact false discovery rate (FDR) control at level $\alpha$ in finite samples, for any data-adaptive update rule the analyst may choose. We suggest heuristic update rules for a variety of applications with complex structural constraints, show that STAR performs well for problems ranging from convex region detection and bump-hunting to FDR control on trees and DAGs, and show how to extend STAR to regression problems where knockoff statistics are available in lieu of $p$-values.
- In the online multiple testing problem, p-values corresponding to different null hypotheses are observed one by one, and the decision of whether or not to reject the current hypothesis must be made immediately, after which the next p-value is observed. Alpha-investing algorithms to control the false discovery rate (FDR), formulated by Foster and Stine, have been generalized and applied to many settings, including quality-preserving databases in science and multiple A/B or multi-armed bandit tests for internet commerce. This paper improves the class of generalized alpha-investing algorithms (GAI) in four ways: (a) we show how to uniformly improve the power of the entire class of monotone GAI procedures by awarding more alpha-wealth for each rejection, giving a win-win resolution to a recent dilemma raised by Javanmard and Montanari, (b) we demonstrate how to incorporate prior weights to indicate domain knowledge of which hypotheses are likely to be non-null, (c) we allow for differing penalties for false discoveries to indicate that some hypotheses may be more important than others, (d) we define a new quantity called the decaying memory false discovery rate (mem-FDR) that may be more meaningful for truly temporal applications, and which alleviates problems that we describe and refer to as "piggybacking" and "alpha-death". Our GAI++ algorithms incorporate all four generalizations simultaneously, and reduce to more powerful variants of earlier algorithms when the weights and decay are all set to unity. Finally, we also describe a simple method to derive new online FDR rules based on an estimated false discovery proportion.
- We propose a top-down algorithm for multiple testing on directed acyclic graphs (DAGs), where nodes represent hypotheses and edges specify a partial ordering in which hypotheses must be tested. The procedure is guaranteed to reject a sub-DAG with bounded false discovery rate (FDR) while satisfying the logical constraint that a rejected node's parents must also be rejected. It is designed for sequential testing settings, when the DAG structure is known a priori, but the p-values are obtained selectively (such as sequential conduction of experiments), but the algorithm is also applicable in non-sequential settings when all p-values can be calculated in advance (such as variable/model selection). Our DAGGER algorithm, shorthand for Greedily Evolving Rejections on DAGs, allows for independence, positive or arbitrary dependence of the p-values, and is guaranteed to work on two different types of DAGs: (a) intersection DAGs in which all nodes are intersection hypotheses, with parents being supersets of children, or (b) general DAGs in which all nodes may be elementary hypotheses. The DAGGER procedure has the appealing property that it specializes to known algorithms in the special cases of trees and line graphs, and simplifies to the classic Benjamini-Hochberg procedure when the DAG has no edges. We explore the empirical performance of DAGGER using simulations, as well as a real dataset corresponding to a gene ontology DAG, showing that it performs favorably in terms of time and power.
- We propose an alternative framework to existing setups for controlling false alarms when multiple A/B tests are run over time. This setup arises in many practical applications, e.g. when pharmaceutical companies test new treatment options against control pills for different diseases, or when internet companies test their default webpages versus various alternatives over time. Our framework proposes to replace a sequence of A/B tests by a sequence of best-arm MAB instances, which can be continuously monitored by the data scientist. When interleaving the MAB tests with an an online false discovery rate (FDR) algorithm, we can obtain the best of both worlds: low sample complexity and any time online FDR control. Our main contributions are: (i) to propose reasonable definitions of a null hypothesis for MAB instances; (ii) to demonstrate how one can derive an always-valid sequential p-value that allows continuous monitoring of each MAB test; and (iii) to show that using rejection thresholds of online-FDR algorithms as the confidence levels for the MAB algorithms results in both sample-optimality, high power and low FDR at any point in time. We run extensive simulations to verify our claims, and also report results on real data collected from the New Yorker Cartoon Caption contest.
- Multiple hypothesis testing is a central topic in statistics, but despite abundant work on the false discovery rate (FDR) and the corresponding Type-II error concept known as the false non-discovery rate (FNR), a fine-grained understanding of the fundamental limits of multiple testing has not been developed. Our main contribution is to derive a precise non-asymptotic tradeoff between FNR and FDR for a variant of the generalized Gaussian sequence model. Our analysis is flexible enough to permit analyses of settings where the problem parameters vary with the number of hypotheses $n$, including various sparse and dense regimes (with $o(n)$ and $\mathcal{O}(n)$ signals). Moreover, we prove that the Benjamini-Hochberg algorithm as well as the Barber-Candès algorithm are both rate-optimal up to constants across these regimes.
- A significant literature studies ways of employing prior knowledge to improve power and precision of multiple testing procedures. Some common forms of prior knowledge may include (a) a priori beliefs about which hypotheses are null, modeled by non-uniform prior weights; (b) differing importances of hypotheses, modeled by differing penalties for false discoveries; (c) multiple arbitrary partitions of the hypotheses into known (possibly overlapping) groups, indicating (dis)similarity of hypotheses; and (d) knowledge of independence, positive or arbitrary dependence between hypotheses or groups, allowing for more aggressive or conservative procedures. We present a unified algorithmic framework called p-filter for global null testing and false discovery rate (FDR) control that allows the scientist to incorporate all four types of prior knowledge (a)-(d) simultaneously, recovering a wide variety of common algorithms as special cases.
- We propose a method to optimize the representation and distinguishability of samples from two probability distributions, by maximizing the estimated power of a statistical test based on the maximum mean discrepancy (MMD). This optimized MMD is applied to the setting of unsupervised learning by generative adversarial networks (GAN), in which a model attempts to generate realistic samples, and a discriminator attempts to tell these apart from data samples. In this context, the MMD may be used in two roles: first, as a discriminator, either directly on the samples, or on features of the samples. Second, the MMD can be used to evaluate the performance of a generative model, by testing the model's samples against a reference data set. In the latter role, the optimized MMD is particularly helpful, as it gives an interpretable indication of how the model and data distributions differ, even in cases where individual model samples are not easily distinguished either by eye or by classifier.
- Slow mixing is the central hurdle when working with Markov chains, especially those used for Monte Carlo approximations (MCMC). In many applications, it is only of interest to estimate the stationary expectations of a small set of functions, and so the usual definition of mixing based on total variation convergence may be too conservative. Accordingly, we introduce function-specific analogs of mixing times and spectral gaps, and use them to prove Hoeffding-like function-specific concentration inequalities. These results show that it is possible for empirical expectations of functions to concentrate long before the underlying chain has mixed in the classical sense, and we show that the concentration rates we achieve are optimal up to constants. We use our techniques to derive confidence intervals that are sharper than those implied by both classical Markov chain Hoeffding bounds and Berry-Esseen-corrected CLT bounds. For applications that require testing, rather than point estimation, we show similar improvements over recent sequential testing results for MCMC. We conclude by applying our framework to real data examples of MCMC, providing evidence that our theory is both accurate and relevant to practice.
- Permutation-valued features arise in a variety of applications, either in a direct way when preferences are elicited over a collection of items, or an indirect way in which numerical ratings are converted to a ranking. To date, there has been relatively limited study of regression, classification, and testing problems based on permutation-valued features, as opposed to permutation-valued responses. This paper studies the use of reproducing kernel Hilbert space methods for learning from permutation-valued features. These methods embed the rankings into an implicitly defined function space, and allow for efficient estimation of regression and test functions in this richer space. Our first contribution is to characterize both the feature spaces and spectral properties associated with two kernels for rankings, the Kendall and Mallows kernels. Using tools from representation theory, we explain the limited expressive power of the Kendall kernel by characterizing its degenerate spectrum, and in sharp contrast, we prove that Mallows' kernel is universal and characteristic. We also introduce families of polynomial kernels that interpolate between the Kendall (degree one) and Mallows' (infinite degree) kernels. We show the practical effectiveness of our methods via applications to Eurobarometer survey data as well as a Movielens ratings dataset.
- Given a weighted graph with $N$ vertices, consider a real-valued regression problem in a semi-supervised setting, where one observes $n$ labeled vertices, and the task is to label the remaining ones. We present a theoretical study of $\ell_p$-based Laplacian regularization under a $d$-dimensional geometric random graph model. We provide a variational characterization of the performance of this regularized learner as $N$ grows to infinity while $n$ stays constant, the associated optimality conditions lead to a partial differential equation that must be satisfied by the associated function estimate $\hat{f}$. From this formulation we derive several predictions on the limiting behavior the $d$-dimensional function $\hat{f}$, including (a) a phase transition in its smoothness at the threshold $p = d + 1$, and (b) a tradeoff between smoothness and sensitivity to the underlying unlabeled data distribution $P$. Thus, over the range $p \leq d$, the function estimate $\hat{f}$ is degenerate and "spiky," whereas for $p\geq d+1$, the function estimate $\hat{f}$ is smooth. We show that the effect of the underlying density vanishes monotonically with $p$, such that in the limit $p = \infty$, corresponding to the so-called Absolutely Minimal Lipschitz Extension, the estimate $\hat{f}$ is independent of the distribution $P$. Under the assumption of semi-supervised smoothness, ignoring $P$ can lead to poor statistical performance, in particular, we construct a specific example for $d=1$ to demonstrate that $p=2$ has lower risk than $p=\infty$ due to the former penalty adapting to $P$ and the latter ignoring it. We also provide simulations that verify the accuracy of our predictions for finite sample sizes. Together, these properties show that $p = d+1$ is an optimal choice, yielding a function estimate $\hat{f}$ that is both smooth and non-degenerate, while remaining maximally sensitive to $P$.
- When data analysts train a classifier and check if its accuracy is significantly different from random guessing, they are implicitly and indirectly performing a hypothesis test (two sample testing) and it is of importance to ask whether this indirect method for testing is statistically optimal or not. Given that hypothesis tests attempt to maximize statistical power subject to a bound on the allowable false positive rate, while prediction attempts to minimize statistical risk on future predictions on unseen data, we wish to study whether a predictive approach for an ultimate aim of testing is prudent. We formalize this problem by considering the two-sample mean-testing setting where one must determine if the means of two Gaussians (with known and equal covariance) are the same or not, but the analyst indirectly does so by checking whether the accuracy achieved by Fisher's LDA classifier is significantly different from chance or not. Unexpectedly, we find that the asymptotic power of LDA's sample-splitting classification accuracy is actually minimax rate-optimal in terms of problem-dependent parameters. Since prediction is commonly thought to be harder than testing, it might come as a surprise to some that solving a harder problem does not create a information-theoretic bottleneck for the easier one. On the flip side, even though the power is rate-optimal, our derivation suggests that it may be worse by a small constant factor; hence practitioners must be wary of using (admittedly flexible) prediction methods on disguised testing problems.
- Linear independence testing is a fundamental information-theoretic and statistical problem that can be posed as follows: given $n$ points $\{(X_i,Y_i)\}^n_{i=1}$ from a $p+q$ dimensional multivariate distribution where $X_i \in \mathbb{R}^p$ and $Y_i \in\mathbb{R}^q$, determine whether $a^T X$ and $b^T Y$ are uncorrelated for every $a \in \mathbb{R}^p, b\in \mathbb{R}^q$ or not. We give minimax lower bound for this problem (when $p+q,n \to \infty$, $(p+q)/n \leq \kappa < \infty$, without sparsity assumptions). In summary, our results imply that $n$ must be at least as large as $\sqrt {pq}/\|\Sigma_{XY}\|_F^2$ for any procedure (test) to have non-trivial power, where $\Sigma_{XY}$ is the cross-covariance matrix of $X,Y$. We also provide some evidence that the lower bound is tight, by connections to two-sample testing and regression in specific settings.
- In many practical applications of multiple hypothesis testing using the False Discovery Rate (FDR), the given hypotheses can be naturally partitioned into groups, and one may not only want to control the number of false discoveries (wrongly rejected null hypotheses), but also the number of falsely discovered groups of hypotheses (we say a group is falsely discovered if at least one hypothesis within that group is rejected, when in reality the group contains only nulls). In this paper, we introduce the p-filter, a procedure which unifies and generalizes the standard FDR procedure by Benjamini and Hochberg and global null testing procedure by Simes. We first prove that our proposed method can simultaneously control the overall FDR at the finest level (individual hypotheses treated separately) and the group FDR at coarser levels (when such groups are user-specified). We then generalize the p-filter procedure even further to handle multiple partitions of hypotheses, since that might be natural in many applications. For example, in neuroscience experiments, we may have a hypothesis for every (discretized) location in the brain, and at every (discretized) timepoint: does the stimulus correlate with activity in location x at time t after the stimulus was presented? In this setting, one might want to group hypotheses by location and by time. Importantly, our procedure can handle multiple partitions which are nonhierarchical (i.e. one partition may arrange p-values by voxel, and another partition arranges them by time point; neither one is nested inside the other). We prove that our procedure controls FDR simultaneously across these multiple lay- ers, under assumptions that are standard in the literature: we do not need the hypotheses to be independent, but require a nonnegative dependence condition known as PRDS.
- Nonparametric two sample or homogeneity testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. The literature is old and rich, with a wide variety of statistics having being intelligently designed and analyzed, both for the unidimensional and the multivariate setting. Our contribution is to tie together many of these tests, drawing connections between seemingly very different statistics. In this work, our central object is the Wasserstein distance, as we form a chain of connections from univariate methods like the Kolmogorov-Smirnov test, PP/QQ plots and ROC/ODC curves, to multivariate tests involving energy statistics and kernel based maximum mean discrepancy. Some connections proceed through the construction of a \textitsmoothed Wasserstein distance, and others through the pursuit of a "distribution-free" Wasserstein test. Some observations in this chain are implicit in the literature, while others seem to have not been noticed thus far. Given nonparametric two sample testing's classical and continued importance, we aim to provide useful connections for theorists and practitioners familiar with one subset of methods but not others.
- Nonparametric two sample testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. We refer to the most common settings as mean difference alternatives (MDA), for testing differences only in first moments, and general difference alternatives (GDA), which is about testing for any difference in distributions. A large number of test statistics have been proposed for both these settings. This paper connects three classes of statistics - high dimensional variants of Hotelling's t-test, statistics based on Reproducing Kernel Hilbert Spaces, and energy statistics based on pairwise distances. We ask the question: how much statistical power do popular kernel and distance based tests for GDA have when the unknown distributions differ in their means, compared to specialized tests for MDA? We formally characterize the power of popular tests for GDA like the Maximum Mean Discrepancy with the Gaussian kernel (gMMD) and bandwidth-dependent variants of the Energy Distance with the Euclidean norm (eED) in the high-dimensional MDA regime. Some practically important properties include (a) eED and gMMD have asymptotically equal power; furthermore they enjoy a free lunch because, while they are additionally consistent for GDA, they also have the same power as specialized high-dimensional t-test variants for MDA. All these tests are asymptotically optimal (including matching constants) under MDA for spherical covariances, according to simple lower bounds, (b) The power of gMMD is independent of the kernel bandwidth, as long as it is larger than the choice made by the median heuristic, (c) There is a clear and smooth computation-statistics tradeoff for linear-time, subquadratic-time and quadratic-time versions of these tests, with more computation resulting in higher power.
- Jun 16 2015 stat.ML arXiv:1506.04725v1We propose a class of nonparametric two-sample tests with a cost linear in the sample size. Two tests are given, both based on an ensemble of distances between analytic functions representing each of the distributions. The first test uses smoothed empirical characteristic functions to represent the distributions, the second uses distribution embeddings in a reproducing kernel Hilbert space. Analyticity implies that differences in the distributions may be detected almost surely at a finite number of randomly chosen locations/frequencies. The new tests are consistent against a larger class of alternatives than the previous linear-time tests based on the (non-smoothed) empirical characteristic functions, while being much faster than the current state-of-the-art quadratic-time kernel-based or energy distance-based tests. Experiments on artificial benchmarks and on challenging real-world testing problems demonstrate that our tests give a better power/time tradeoff than competing approaches, and in some cases, better outright power than even the most expensive quadratic-time tests. This performance advantage is retained even in high dimensions, and in cases where the difference in distributions is not observable with low order statistics.
- We propose a new algorithmic framework for sequential hypothesis testing with i.i.d. data, which includes A/B testing, nonparametric two-sample testing, and independence testing as special cases. It is novel in several ways: (a) it takes linear time and constant space to compute on the fly, (b) it has the same power guarantee as a non-sequential version of the test with the same computational constraints up to a small factor, and (c) it accesses only as many samples as are required - its stopping time adapts to the unknown difficulty of the problem. All our test statistics are constructed to be zero-mean martingales under the null hypothesis, and the rejection threshold is governed by a uniform non-asymptotic law of the iterated logarithm (LIL). For the case of nonparametric two-sample mean testing, we also provide a finite sample power analysis, and the first non-asymptotic stopping time calculations for this class of problems. We verify our predictions for type I and II errors and stopping times using simulations.
- In active learning, the user sequentially chooses values for feature $X$ and an oracle returns the corresponding label $Y$. In this paper, we consider the effect of feature noise in active learning, which could arise either because $X$ itself is being measured, or it is corrupted in transmission to the oracle, or the oracle returns the label of a noisy version of the query point. In statistics, feature noise is known as "errors in variables" and has been studied extensively in non-active settings. However, the effect of feature noise in active learning has not been studied before. We consider the well-known Berkson errors-in-variables model with additive uniform noise of width $\sigma$. Our simple but revealing setting is that of one-dimensional binary classification setting where the goal is to learn a threshold (point where the probability of a $+$ label crosses half). We deal with regression functions that are antisymmetric in a region of size $\sigma$ around the threshold and also satisfy Tsybakov's margin condition around the threshold. We prove minimax lower and upper bounds which demonstrate that when $\sigma$ is smaller than the minimiax active/passive noiseless error derived in \citeCN07, then noise has no effect on the rates and one achieves the same noiseless rates. For larger $\sigma$, the \textitunflattening of the regression function on convolution with uniform noise, along with its local antisymmetry around the threshold, together yield a behaviour where noise \textitappears to be beneficial. Our key result is that active learning can buy significant improvement over a passive strategy even in the presence of feature noise.
- Interesting theoretical associations have been established by recent papers between the fields of active learning and stochastic convex optimization due to the common role of feedback in sequential querying mechanisms. In this paper, we continue this thread in two parts by exploiting these relations for the first time to yield novel algorithms in both fields, further motivating the study of their intersection. First, inspired by a recent optimization algorithm that was adaptive to unknown uniform convexity parameters, we present a new active learning algorithm for one-dimensional thresholds that can yield minimax rates by adapting to unknown noise parameters. Next, we show that one can perform $d$-dimensional stochastic minimization of smooth uniformly convex functions when only granted oracle access to noisy gradient signs along any coordinate instead of real-valued gradients, by using a simple randomized coordinate descent procedure where each line search can be solved by $1$-dimensional active learning, provably achieving the same error convergence rate as having the entire real-valued gradient. Combining these two parts yields an algorithm that solves stochastic convex optimization of uniformly convex and smooth functions using only noisy gradient signs by repeatedly performing active learning, achieves optimal rates and is adaptive to all unknown convexity and smoothness parameters.
- Nonparametric two sample testing deals with the question of consistently deciding if two distributions are different, given samples from both, without making any parametric assumptions about the form of the distributions. The current literature is split into two kinds of tests - those which are consistent without any assumptions about how the distributions may differ (\textitgeneral alternatives), and those which are designed to specifically test easier alternatives, like a difference in means (\textitmean-shift alternatives). The main contribution of this paper is to explicitly characterize the power of a popular nonparametric two sample test, designed for general alternatives, under a mean-shift alternative in the high-dimensional setting. Specifically, we explicitly derive the power of the linear-time Maximum Mean Discrepancy statistic using the Gaussian kernel, where the dimension and sample size can both tend to infinity at any rate, and the two distributions differ in their means. As a corollary, we find that if the signal-to-noise ratio is held constant, then the test's power goes to one if the number of samples increases faster than the dimension increases. This is the first explicit power derivation for a general nonparametric test in the high-dimensional setting, and also the first analysis of how tests designed for general alternatives perform when faced with easier ones.
- Given a matrix $A$, a linear feasibility problem (of which linear classification is a special case) aims to find a solution to a primal problem $w: A^Tw > \textbf{0}$ or a certificate for the dual problem which is a probability distribution $p: Ap = \textbf{0}$. Inspired by the continued importance of "large-margin classifiers" in machine learning, this paper studies a condition measure of $A$ called its \textitmargin that determines the difficulty of both the above problems. To aid geometrical intuition, we first establish new characterizations of the margin in terms of relevant balls, cones and hulls. Our second contribution is analytical, where we present generalizations of Gordan's theorem, and variants of Hoffman's theorems, both using margins. We end by proving some new results on a classical iterative scheme, the Perceptron, whose convergence rates famously depends on the margin. Our results are relevant for a deeper understanding of margin-based learning and proving convergence rates of iterative schemes, apart from providing a unifying perspective on this vast topic.
- This paper is about randomized iterative algorithms for solving a linear system of equations $X \beta = y$ in different settings. Recent interest in the topic was reignited when Strohmer and Vershynin (2009) proved the linear convergence rate of a Randomized Kaczmarz (RK) algorithm that works on the rows of $X$ (data points). Following that, Leventhal and Lewis (2010) proved the linear convergence of a Randomized Coordinate Descent (RCD) algorithm that works on the columns of $X$ (features). The aim of this paper is to simplify our understanding of these two algorithms, establish the direct relationships between them (though RK is often compared to Stochastic Gradient Descent), and examine the algorithmic commonalities or tradeoffs involved with working on rows or columns. We also discuss Kernel Ridge Regression and present a Kaczmarz-style algorithm that works on data points and having the advantage of solving the problem without ever storing or forming the Gram matrix, one of the recognized problems encountered when scaling kernelized methods.
- This paper is about two related decision theoretic problems, nonparametric two-sample testing and independence testing. There is a belief that two recently proposed solutions, based on kernels and distances between pairs of points, behave well in high-dimensional settings. We identify different sources of misconception that give rise to the above belief. Specifically, we differentiate the hardness of estimation of test statistics from the hardness of testing whether these statistics are zero or not, and explicitly discuss a notion of "fair" alternative hypotheses for these problems as dimension increases. We then demonstrate that the power of these tests actually drops polynomially with increasing dimension against fair alternatives. We end with some theoretical insights and shed light on the \textitmedian heuristic for kernel bandwidth selection. Our work advances the current understanding of the power of modern nonparametric hypothesis tests in high dimensions.
- Jun 10 2014 stat.ML arXiv:1406.1922v2This paper deals with the problem of nonparametric independence testing, a fundamental decision-theoretic problem that asks if two arbitrary (possibly multivariate) random variables $X,Y$ are independent or not, a question that comes up in many fields like causality and neuroscience. While quantities like correlation of $X,Y$ only test for (univariate) linear independence, natural alternatives like mutual information of $X,Y$ are hard to estimate due to a serious curse of dimensionality. A recent approach, avoiding both issues, estimates norms of an \textitoperator in Reproducing Kernel Hilbert Spaces (RKHSs). Our main contribution is strong empirical evidence that by employing \textitshrunk operators when the sample size is small, one can attain an improvement in power at low false positive rates. We analyze the effects of Stein shrinkage on a popular test statistic called HSIC (Hilbert-Schmidt Independence Criterion). Our observations provide insights into two recently proposed shrinkage estimators, SCOSE and FCOSE - we prove that SCOSE is (essentially) the optimal linear shrinkage method for \textitestimating the true operator; however, the non-linearly shrunk FCOSE usually achieves greater improvements in \textittest power. This work is important for more powerful nonparametric detection of subtle nonlinear dependencies for small samples.
- This paper presents a fast and robust algorithm for trend filtering, a recently developed nonparametric regression tool. It has been shown that, for estimating functions whose derivatives are of bounded variation, trend filtering achieves the minimax optimal error rate, while other popular methods like smoothing splines and kernels do not. Standing in the way of a more widespread practical adoption, however, is a lack of scalable and numerically stable algorithms for fitting trend filtering estimates. This paper presents a highly efficient, specialized ADMM routine for trend filtering. Our algorithm is competitive with the specialized interior point methods that are currently in use, and yet is far more numerically robust. Furthermore, the proposed ADMM implementation is very simple, and importantly, it is flexible enough to extend to many interesting related problems, such as sparse trend filtering and isotonic trend filtering. Software for our method is freely available, in both the C and R languages.
- Jan 28 2014 stat.AP arXiv:1401.6595v3Functional neuroimaging measures how the brain responds to complex stimuli. However, sample sizes are modest, noise is substantial, and stimuli are high dimensional. Hence, direct estimates are inherently imprecise and call for regularization. We compare a suite of approaches which regularize via shrinkage: ridge regression, the elastic net (a generalization of ridge regression and the lasso), and a hierarchical Bayesian model based on small area estimation (SAE). We contrast regularization with spatial smoothing and combinations of smoothing and shrinkage. All methods are tested on functional magnetic resonance imaging (fMRI) data from multiple subjects participating in two different experiments related to reading, for both predicting neural response to stimuli and decoding stimuli from responses. Interestingly, when the regularization parameters are chosen by cross-validation independently for every voxel, low/high regularization is chosen in voxels where the classification accuracy is high/low, indicating that the regularization intensity is a good tool for identification of relevant voxels for the cognitive task. Surprisingly, all the regularization methods work about equally well, suggesting that beating basic smoothing and shrinkage will take not only clever methods, but also careful modeling.
- We focus on the problem of minimizing a convex function $f$ over a convex set $S$ given $T$ queries to a stochastic first order oracle. We argue that the complexity of convex minimization is only determined by the rate of growth of the function around its minimizer $x^*_{f,S}$, as quantified by a Tsybakov-like noise condition. Specifically, we prove that if $f$ grows at least as fast as $\|x-x^*_{f,S}\|^\kappa$ around its minimum, for some $\kappa > 1$, then the optimal rate of learning $f(x^*_{f,S})$ is $\Theta(T^{-\frac{\kappa}{2\kappa-2}})$. The classic rate $\Theta(1/\sqrt T)$ for convex functions and $\Theta(1/T)$ for strongly convex functions are special cases of our result for $\kappa \rightarrow \infty$ and $\kappa=2$, and even faster rates are attained for $\kappa <2$. We also derive tight bounds for the complexity of learning $x_{f,S}^*$, where the optimal rate is $\Theta(T^{-\frac{1}{2\kappa-2}})$. Interestingly, these precise rates for convex optimization also characterize the complexity of active learning and our results further strengthen the connections between the two fields, both of which rely on feedback-driven queries.