CMU-CS-18-123Computer Science Department School of Computer Science, Carnegie Mellon University
CMU-CS-18-123
Colin White Ph.D. Thesis October 2018
Keywords:
Beyond Worst-Case Analysis, Clustering, k-means, Perturbation Resilience, Distributed Computation, Algorithm Configuration
Traditionally, the theory of algorithms has focused on the analysis of worst-case instances. While this has led to beautiful results and a thorough understanding of a wide variety of problems, there are many problems for which worst-case analysis is not useful for empirical or real-world instances. A rapidly developing line of research, the so-called
- Clustering is one problem that has benefited greatly from BWCA. The
*perturbation resilience*assumption proposed by Bilu and Linial (2011) states that the optimal clustering does not change under small perturbations to the input distances. We design efficient algorithms that output optimal or near-optimal clusterings for the canonical*k*-center objective (in both symmetric and asymmetric forms) under perturbation resilience. Our results are tight up to the level of perturbation resilience assumed. Next, we give robust algorithms with guarantees even if only part of the data satisfies the perturbation resilience assumption. - Clustering has many varied applications in practice, and an algorithm may have
drastically different performance between two applications. To reconcile this
fact, we consider a data-driven approach, in which a clustering application is
represented by a distribution over problem instances, and the goal is to find the best algorithm for the (unknown) distribution. This model was first studied theoretically by Gupta and Roughgarden (2016). We define rich, infinite classes
of linkage algorithms with dynamic programming, as well as local search algorithms with initialization, generalizing the
*k*-means algorithm. We bound the pseudo-dimension of these classes, which leads to computational- and sample-efficient meta-algorithms for some of the algorithm classes we study. - As datasets become larger and more complex, distributed learning becomes more prevalent. Motivated by the fact that similar datapoints often belong to the same or similar classes, we propose data-dependent dispatching that takes advantage of such structure. We provide new dispatching algorithms which cast the problem as clustering with important balance and fault-tolerance conditions. Finally, we design general and robust distributed clustering algorithms.
Srinivasan Seshan, Head, Computer Science Department
| |

Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |