MACHINE LEARNING TECHNICAL REPORT ABSTRACTS

	CMU-ML-08-117 Machine Learning Department School of Computer Science, Carnegie Mellon University CMU-ML-08-117 HADI: Fast Diameter Estimation and Mining in Massive Graphs with Hadoop U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec* December 2008 CMU-ML-08-117.pdf Keywords: Diameter, graph, hadoop How can we quickly find the diameter of a petabyte-sized graph? Large graphs are ubiquitous: social networks (Facebook, LinkedIn, etc.), the World Wide Web, biological networks, computer networks and many more. The size of graphs of interest has been increasing rapidly in recent years and with it also the need for algorithms that can handle tera- and peta-byte graphs. A promising direction for coping with such sizes is the emerging map/reduce architecture and its open-source implementation, 'Hadoop'. Estimating the diameter of a graph, as well as the radius of each node, is a valuable operation that can help us spot outliers and anomalies. We propose HADI (HAdoop based DIameter estimator), a carefully designed algorithm to compute the diameters of petabyte-scale graphs. We run the algorithm to analyze the largest public web graph ever analyzed, with billions of nodes and edges. Additional contributions include the following: (a) We propose several performance optimizations (b) we achieve excellent scale-up, and (c) we report interesting observations including outliers and related patterns, on this real graph (116Gb), as well as several other real, smaller graphs. One of the observations is that the Albert et al. conjecture about the diameter of the web is over-pessimistic. 26 pages ICMC - USP São Carlos, Brazil *Computer Science Department, Cornell/Stanford University

SCS Technical Report Collection School of Computer Science homepage This page maintained by reports@cs.cmu.edu