CALD TECHNICAL REPORT ABSTRACTS

	CMU-CALD-04-107 Center for Automated Learning and Discovery School of Computer Science, Carnegie Mellon University CMU-CALD-04-107 Fully Automatic Cross-Associations Deepayan Chakrabarti, Spiros Papadimitriou, Dharmendra S. Modha, Christos Faloutsos August 2004 CMU-CALD-04-107.pdf Keywords:* Information Theory, MDL, Cross-Association Large, sparse binary matrices arise in numerous data mining applications, such as the analysis of market baskets, web graphs, social networks, co-citations, as well as information retrieval, collaborative filtering, sparse matrix reordering, etc. Virtually all popular methods for the analysis of such matrices e.g., k-means clustering, METIS graph partitioning, SVD/PCA and frequent itemset mining require the user to specify various parameters, such as the number of clusters, number of principal components, number of partitions, and "support". Choosing suitable values for such parameters is a challenging problem. Cross-association is a joint decomposition of a binary matrix into disjoint row and column groups such that the rectangular intersections of groups are homogeneous. Starting from first principles, we furnish a clear, information-theoretic criterion to choose a good cross-association as well as its parameters, namely, the number of row and column groups. We provide scalable algorithms to approach the optimal. Our algorithm is parameter-free, and requires no user intervention. In practice it scales linearly with the problem size, and is thus applicable to very large matrices. Finally, we present experiments on multiple synthetic and real-life datasets, where our method gives high-quality, intuitive results. 23 pages *IBM Almaden Research Center, San Jose, CA, USA

SCS Technical Report Collection School of Computer Science homepage This page maintained by reports@cs.cmu.edu