We propose a family of near-metrics based on local graph diffusion to capture similarity for a wide class of data sets. These quasi-metametrics, as their names suggest, dispense with one or two standard axioms of metric spaces, specifically distinguishability and symmetry, so that similarity between data points of arbitrary type and form could be measured broadly and effectively. The proposed near-metric family includes the forward k-step diffusion and its reverse, typically on the graph consisting of data objects and their features. By construction, this family of near-metrics is particularly appropriate for categorical data, continuous data, and vector representations of images and text extracted via deep learning approaches. We conduct extensive experiments to evaluate the performance of this family of similarity measures and compare and contrast with traditional measures of similarity used for each specific application and with the ground truth when available. We show that for structured data including categorical and continuous data, the near-metrics corresponding to normalized forward k-step diffusion (k small) work as one of the best performing similarity measures; for vector representations of text and images including those extracted from deep learning, the near-metrics derived from normalized and reverse k-step graph diffusion (k very small) exhibit outstanding ability to distinguish data points from different classes.
Recently, there has been substantial interest in clustering research that takes a beyond worst-case approach to the analysis of algorithms. The typical idea is to design a clustering algorithm that outputs a near-optimal solution, provided the data satisfy a natural stability notion. For example, Bilu and Linial (2010) and Awasthi et al. (2012) presented algorithms that output near-optimal solutions, assuming the optimal solution is preserved under small perturbations to the input distances. A drawback to this approach is that the algorithms are often explicitly built according to the stability assumption and give no guarantees in the worst case; indeed, several recent algorithms output arbitrarily bad solutions even when just a small section of the data does not satisfy the given stability notion. In this work, we address this concern in two ways. First, we provide algorithms that inherit the worst-case guarantees of clustering approximation algorithms, while simultaneously guaranteeing near-optimal solutions when the data is stable. Our algorithms are natural modifications to existing state-of-the-art approximation algorithms. Second, we initiate the study of local stability, which is a property of a single optimal cluster rather than an entire optimal solution. We show our algorithms output all optimal clusters which satisfy stability locally. Specifically, we achieve strong positive results in our local framework under recent stability notions including metric perturbation resilience (Angelidakis et al. 2017) and robust perturbation resilience (Balcan and Liang 2012) for the $k$-median, $k$-means, and symmetric/asymmetric $k$-center objectives.
As datasets become larger and more distributed, algorithms for distributed clustering have become more and more important. In this work, we present a general framework for designing distributed clustering algorithms that are robust to outliers. Using our framework, we give a distributed approximation algorithm for k-means, k-median, or generally any L_p objective, with z outliers and/or balance constraints, using O(m(k+z)(d+log n)) bits of communication, where m is the number of machines, n is the size of the point set, and d is the dimension. This generalizes and improves over previous work of Bateni et al. and Malkomes et al. As a special case, we achieve the first distributed algorithm for k-median with outliers, answering an open question posed by Malkomes et al. For distributed k-means clustering, we provide the first dimension-dependent communication complexity lower bound for finding the optimal clustering. This improves over the lower bound from Chen et al. which is dimension-agnostic. Furthermore, we give distributed clustering algorithms which return nearly optimal solutions, provided the data satisfies the approximation stability condition of Balcan et al. or the spectral stability condition of Kumar and Kannan.
Max-cut, clustering, and many other partitioning problems that are of significant importance to machine learning and other scientific fields are NP-hard, a reality that has motivated researchers to develop a wealth of approximation algorithms and heuristics. Although the best algorithm to use typically depends on the specific application domain, a worst-case analysis is often used to compare algorithms. This may be misleading if worst-case instances occur infrequently, and thus there is a demand for optimization methods which return the algorithm configuration best suited for the given application's typical inputs. We address this problem for clustering, max-cut, and other partitioning problems, such as integer quadratic programming, by designing computationally efficient and sample efficient learning algorithms which receive samples from an application-specific distribution over problem instances and learn a partitioning algorithm with high expected performance. Our algorithms learn over common integer quadratic programming and clustering algorithm families: SDP rounding algorithms and agglomerative clustering algorithms with dynamic programming. For our sample complexity analysis, we provide tight bounds on the pseudodimension of these algorithm classes, and show that surprisingly, even for classes of algorithms parameterized by a single parameter, the pseudo-dimension is superconstant. In this way, our work both contributes to the foundations of algorithm configuration and pushes the boundaries of learning theory, since the algorithm classes we analyze consist of multi-stage optimization procedures and are significantly more complex than classes typically studied in learning theory.
A large body of work in machine learning has focused on the problem of learning a close approximation to an underlying combinatorial function, given a small set of labeled examples. However, for real-valued functions, cardinal labels might not be accessible, or it may be difficult for an expert to consistently assign real-valued labels over the entire set of examples. For instance, it is notoriously hard for consumers to reliably assign values to bundles of merchandise. Instead, it might be much easier for a consumer to report which of two bundles she likes better. With this motivation in mind, we consider an alternative learning model, wherein the algorithm must learn the underlying function up to pairwise comparisons, from pairwise comparisons. In this model, we present a series of novel algorithms that learn over a wide variety of combinatorial function classes. These range from graph functions to broad classes of valuation functions that are fundamentally important in microeconomic theory, the analysis of social networks, and machine learning, such as coverage, submodular, XOS, and subadditive functions, as well as functions with sparse Fourier support.
In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be "locally simple but globally complex" (Vapnik & Bottou 1993), we propose data dependent dispatching that takes advantage of such structure. We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power.
The $k$-center problem is a canonical and long-studied facility location and clustering problem with many applications in both its symmetric and asymmetric forms. Both versions of the problem have tight approximation factors on worst case instances: a $2$-approximation for symmetric $k$-center and an $O(\log^*(k))$-approximation for the asymmetric version. In this work, we go beyond the worst case and provide strong positive results both for the asymmetric and symmetric $k$-center problems under a very natural input stability (promise) condition called $\alpha$-perturbation resilience (Bilu & Linial 2012) , which states that the optimal solution does not change under any $\alpha$-factor perturbation to the input distances. We show that by assuming 2-perturbation resilience, the exact solution for the asymmetric $k$-center problem can be found in polynomial time. To our knowledge, this is the first problem that is hard to approximate to any constant factor in the worst case, yet can be optimally solved in polynomial time under perturbation resilience for a constant value of $\alpha$. Furthermore, we prove our result is tight by showing symmetric $k$-center under $(2-\epsilon)$-perturbation resilience is hard unless $NP=RP$. This is the first tight result for any problem under perturbation resilience, i.e., this is the first time the exact value of $\alpha$ for which the problem switches from being NP-hard to efficiently computable has been found. Our results illustrate a surprising relationship between symmetric and asymmetric $k$-center instances under perturbation resilience. Unlike approximation ratio, for which symmetric $k$-center is easily solved to a factor of $2$ but asymmetric $k$-center cannot be approximated to any constant factor, both symmetric and asymmetric $k$-center can be solved optimally under resilience to 2-perturbations.
Jan 20 2015 cs.DS
In the last decade, there has been a substantial amount of research in finding routing algorithms designed specifically to run on real-world graphs. In 2010, Abraham et al. showed upper bounds on the query time in terms of a graph's highway dimension and diameter for the current fastest routing algorithms, including contraction hierarchies, transit node routing, and hub labeling. In this paper, we show corresponding lower bounds for the same three algorithms. We also show how to improve a result by Milosavljevic which lower bounds the number of shortcuts added in the preprocessing stage for contraction hierarchies. We relax the assumption of an optimal contraction order (which is NP-hard to compute), allowing the result to be applicable to real-world instances. Finally, we give a proof that optimal preprocessing for hub labeling is NP-hard. Hardness of optimal preprocessing is known for most routing algorithms, and was suspected to be true for hub labeling.
In conventional supervised pattern recognition tasks, model selection is typically accomplished by minimizing the classification error rate on a set of so-called development data, subject to ground-truth labeling by human experts or some other means. In the context of speech processing systems and other large-scale practical applications, however, such labeled development data are typically costly and difficult to obtain. This article proposes an alternative semi-supervised framework for likelihood-based model selection that leverages unlabeled data by using trained classifiers representing each model to automatically generate putative labels. The errors that result from this automatic labeling are shown to be amenable to results from robust statistics, which in turn provide for minimax-optimal censored likelihood ratio tests that recover the nonparametric sign test as a limiting case. This approach is then validated experimentally using a state-of-the-art automatic speech recognition system to select between candidate word pronunciations using unlabeled speech data that only potentially contain instances of the words under test. Results provide supporting evidence for the utility of this approach, and suggest that it may also find use in other applications of machine learning.