COMPUTER SCIENCE TECHNICAL REPORT ABSTRACTS

CMU-CS-15-116
Computer Science Department
School of Computer Science, Carnegie Mellon University

CMU-CS-15-116

Interactive Algorithms for Unsupervised Machine Learning

Akshay Krishnamurthy

June 2015

Ph.D. Thesis

Keywords: Statistical Machine Learning, Interactive Learning, Unsupervised Learning, Matrix Completion, Subspace Learning, Hierarchical Clustering, Network Tomography

This thesis explores the power of interactivity in unsupervised machine learning problems. Interactive algorithms employ feedback-driven measurements to reduce data acquisition costs and consequently enable statistical analysis in otherwise intractable settings. Unsupervised learning methods are fundamental tools across a variety of domains, and interactive procedures promise to broaden the scope of statistical analysis. We develop interactive learning algorithms for three unsupervised problems: subspace learning, clustering, and tree metric learning. Our theoretical and empirical analysis shows that interactivity can bring both statistical and computational improvements over non-interactive approaches. An over-arching thread of this thesis is that interactive learning is particularly powerful for non-uniform datasets, where non-uniformity is quantified differently in each setting.

We first study the subspace learning problem, where the goal is to recover or approximate the principal subspace of a collection of partially observed data points. We propose statistically and computationally appealing interactive algorithms for both the matrix completion problem, where the data points lie on a low dimensional subspace, and the matrix approximation problem, where one must approximate the principal components of a collection of points. We measure uniformity with the notion of incoherence, and we show that our feedback-driven algorithms perform well under much milder incoherence assumptions.

We next consider clustering a dataset represented by a partially observed similarity matrix. We propose an interactive procedure for recovering a clustering from a small number of carefully selected similarity measurements. The algorithm exploits non-uniformity of cluster size, using few measurements to recover larger clusters and focusing measurements on the smaller structures. In addition to coming with strong statistical and computational guarantees, this algorithm performs well in practice. We also consider a specific metric learning problem, where we compute a latent tree metric to approximate distances over a point set. This problem is motivated by applications in network tomography, where the goal is to approximate the network structure using only measurements between pairs of end hosts. Our algorithms use an interactively chosen subset of the pairwise distances to learn the latent tree metric while being robust to either additive noise or a small number of arbitrarily corrupted distances. As before, we leverage non-uniformity inherent in the tree metric structure to achieve low sample complexity.

Finally, we study a classical hypothesis testing problem where we focus on show fundamental limits for non-interactive approaches. Our main result is a precise characterization of the performance of non-interactive approaches, which shows that, on particular problems, all non-interactive approaches are statistically weaker than a simple interactive one. These results bolster the theme that interactivity can bring about statistical improvements in unsupervised problems.

147 pages

Thesis Committee:
Aarti Singh (Chair)
Maria-Florina Balcan
Barnabás Poczós
Larry Wasserman
Sanjoy Dasgupta (University of California, San Diego)
John Langford (Microsoft Research)

Frank Pfenning, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu