CMU-CALD-04-107
Center for Automated Learning and Discovery
School of Computer Science, Carnegie Mellon University



CMU-CALD-04-107

Fully Automatic Cross-Associations

Deepayan Chakrabarti, Spiros Papadimitriou,
Dharmendra S. Modha*, Christos Faloutsos

August 2004

CMU-CALD-04-107.pdf


Keywords: Information Theory, MDL, Cross-Association

Large, sparse binary matrices arise in numerous data mining applications, such as the analysis of market baskets, web graphs, social networks, co-citations, as well as information retrieval, collaborative filtering, sparse matrix reordering, etc. Virtually all popular methods for the analysis of such matrices e.g., k-means clustering, METIS graph partitioning, SVD/PCA and frequent itemset mining require the user to specify various parameters, such as the number of clusters, number of principal components, number of partitions, and "support". Choosing suitable values for such parameters is a challenging problem.

Cross-association is a joint decomposition of a binary matrix into disjoint row and column groups such that the rectangular intersections of groups are homogeneous. Starting from first principles, we furnish a clear, information-theoretic criterion to choose a good cross-association as well as its parameters, namely, the number of row and column groups. We provide scalable algorithms to approach the optimal. Our algorithm is parameter-free, and requires no user intervention. In practice it scales linearly with the problem size, and is thus applicable to very large matrices. Finally, we present experiments on multiple synthetic and real-life datasets, where our method gives high-quality, intuitive results.

23 pages

*IBM Almaden Research Center, San Jose, CA, USA


SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu