CMU-CS-07-143Computer Science Department School of Computer Science, Carnegie Mellon University
CMU-CS-07-143
Maria-Florina Balcan, Avrim Blum, Santosh Vempala* July 2007
Keywords: Clustering framework, similarity functions, list
clustering, tree clustering, clustering complexity, linkage based
algorithms
Problems of clustering data from pairwise similarity information
are ubiquitous in Computer Science. Theoretical treatments
typically view the similarity information as ground-truth and then
design algorithms to (approximately) optimize various graph-based
objective functions. However, in most applications, this similarity
information is merely based on some heuristic:
the true goal is to cluster the points correctly rather than to
optimize any specific graph property. In this work, we initiate a
theoretical study of the design of similarity functions for clustering
from this perspective.
In particular, motivated by recent work in learning
theory that asks "what natural properties of a similarity function
are sufficient to be able to learn well?" we ask "what natural
properties of a similarity function are sufficient to be able to
We develop a notion of the We also show how our algorithms can be extended to the inductive case, i.e., by using just a constant-sized sample, as in property testing. The analysis here uses regularity-type result of [18] and [4]. 25 pages *College of Computing, Georgia Institute of Technology
| |

Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |