Computer Science Department
School of Computer Science, Carnegie Mellon University


A Theory of Similarity Functions for Clustering

Maria-Florina Balcan, Avrim Blum, Santosh Vempala*

July 2007


Keywords: Clustering framework, similarity functions, list clustering, tree clustering, clustering complexity, linkage based algorithms

Problems of clustering data from pairwise similarity information are ubiquitous in Computer Science. Theoretical treatments typically view the similarity information as ground-truth and then design algorithms to (approximately) optimize various graph-based objective functions. However, in most applications, this similarity information is merely based on some heuristic: the true goal is to cluster the points correctly rather than to optimize any specific graph property. In this work, we initiate a theoretical study of the design of similarity functions for clustering from this perspective. In particular, motivated by recent work in learning theory that asks "what natural properties of a similarity function are sufficient to be able to learn well?" we ask "what natural properties of a similarity function are sufficient to be able to cluster well?"

We develop a notion of the clustering complexity of a given property (analogous to notions of capacity in learning theory), that characterizes its information-theoretic usefulness for clustering. We then analyze this complexity for several natural game-theoretic and learning-theoretic properties, as well as design efficient algorithms that are able to take advantage of them. We consider two natural clustering objectives: (a) list clustering: analogous to the notion of list-decoding, the algorithm can produce a small list of clusterings (which a user can select from) and (b) hierarchical clustering: the desired clustering is some pruning of this tree (which a user could navigate). Our algorithms for hierarchical clustering combine recent learning-theoretic approaches with linkage-style methods.

We also show how our algorithms can be extended to the inductive case, i.e., by using just a constant-sized sample, as in property testing. The analysis here uses regularity-type result of [18] and [4].

25 pages

*College of Computing, Georgia Institute of Technology

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by