COMPUTER SCIENCE TECHNICAL REPORT ABSTRACTS

CMU-CS-18-123
Computer Science Department
School of Computer Science, Carnegie Mellon University

CMU-CS-18-123

New Aspects of Beyond Worst-Case Analysis

Colin White

Ph.D. Thesis

October 2018

Keywords: Beyond Worst-Case Analysis, Clustering, k-means, Perturbation Resilience, Distributed Computation, Algorithm Configuration

Traditionally, the theory of algorithms has focused on the analysis of worst-case instances. While this has led to beautiful results and a thorough understanding of a wide variety of problems, there are many problems for which worst-case analysis is not useful for empirical or real-world instances. A rapidly developing line of research, the so-called beyond worst-case analysis of algorithms (BWCA), considers the design and analysis of problems using more realistic models or using natural structural properties. The goals of BWCA are to design realistic models and new theoretical algorithms which perform well in practice, as well as to explain why preexisting algorithms that do well in practice perform better than worst-case analysis suggests. In other words, the goal of BWCA is to bridge the gap between theory and practice. In this thesis, we continue the line of work of BWCA for clustering and distributed learning by making contributions in several areas. Specifically, we focus on three main problems and models.

Clustering is one problem that has benefited greatly from BWCA. The perturbation resilience assumption proposed by Bilu and Linial (2011) states that the optimal clustering does not change under small perturbations to the input distances. We design efficient algorithms that output optimal or near-optimal clusterings for the canonical k-center objective (in both symmetric and asymmetric forms) under perturbation resilience. Our results are tight up to the level of perturbation resilience assumed. Next, we give robust algorithms with guarantees even if only part of the data satisfies the perturbation resilience assumption.
Clustering has many varied applications in practice, and an algorithm may have drastically different performance between two applications. To reconcile this fact, we consider a data-driven approach, in which a clustering application is represented by a distribution over problem instances, and the goal is to find the best algorithm for the (unknown) distribution. This model was first studied theoretically by Gupta and Roughgarden (2016). We define rich, infinite classes of linkage algorithms with dynamic programming, as well as local search algorithms with initialization, generalizing the k-means algorithm. We bound the pseudo-dimension of these classes, which leads to computational- and sample-efficient meta-algorithms for some of the algorithm classes we study.
As datasets become larger and more complex, distributed learning becomes more prevalent. Motivated by the fact that similar datapoints often belong to the same or similar classes, we propose data-dependent dispatching that takes advantage of such structure. We provide new dispatching algorithms which cast the problem as clustering with important balance and fault-tolerance conditions. Finally, we design general and robust distributed clustering algorithms.

Thesis Committee:
Marina-Florina Balcan (Chair)
Anupam Gupta
David Woodruff
Avrim Blum (Toyota Technological Institute of Chicago)
Yury Makarychev (Toyota Technological Institute of Chicago

Srinivasan Seshan, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science

163 pages

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu