MACHINE LEARNING TECHNICAL REPORT ABSTRACTS

	CMU-ML-09-112 Machine Learning Department School of Computer Science, Carnegie Mellon University CMU-ML-09-112 Fast Algorithms for Querying and Mining Large Graphs Hanghang Tong September 2009 Ph.D. Thesis CMU-ML-09-112.pdf Keywords: Network analysis, querying, mining, complex user specific pattern, trend analysis, scalability, immunization, anomaly detection, proximity, spectral analysis, low-rank-approximation Graphs appear in a wide range of settings and have posed a wealth of fascinating problems. In this thesis, we focus on two types of tasks according to the interaction with users: (1) querying (e.g., given a social network, how to measure the closeness between two persons? how to track it over time?) and (2) mining (e.g., how to identify abnormal behaviors of computer networks? In the case of virus attacks, which nodes are the best to immunize?). The task of querying includes three sub-tasks. In the first one, we found that many complex user-specific patterns on large graphs can be answered by means of proximity measurement. In other words, proximity allows us to query large graphs on the atomic level. We support our claim by conducting three case studies (connection subgraphs, user feedback, and gateway), all of which (despite their diversity) rely on the proximity measurement as their building block. The proposed algorithms are operational, with careful design and numerous optimizations. For the second sub-task, in order to adapt the querying task to time-evolving graphs, we proposed an efficient algorithm to track proximity on time-evolving graphs, which enables us to do trend analysis on the graph level. The proposed algorithm is up to 176x faster than competitors and has no quality loss. Finally, in order to handle the scalability issue in the task of querying, we developed a family of fast solutions to compute the proximity in several different scenarios. By carefully leveraging some important properties shared by many real graphs (e.g., the block-wise structure, the linear correlation, the skewness of real bipartite graphs, etc), we can often achieve orders of magnitude of speedup with little or no quality loss. The task of mining also includes three sub-tasks. In the first one, we proposed an algorithm (NetShield) for immunization under the SIS model. While straight-forward methods are computationally intractable (O((ⁿk) m)), the proposed algorithm is nearoptimal, fast (up to 7 orders of magnitude speedup), and scalable (O(nk² + m)). In the second sub-task, we proposed a family of example-based low-rank matrix approximation methods for anomaly detection. The proposed algorithms are provably equal to or better than the best known methods in both space and time, with the same accuracy. On real data sets, it is up to 112x faster than the best competitors, for the same accuracy. Finally, we showed that graphs also provide a powerful tool to solve some complex problems. As a case study, we proposed a general framework to mine complex time stamped events (e.g., to find similar time stamps, to find abnormal time stamps and to provide interpretations for our findings, etc) by envisioning the problem as a graph analysis problem. We further proposed MT3 to handle multiple-scale analysis, achieving up to 2 orders of magnitude speedup, with the same quality. 210 pages

SCS Technical Report Collection School of Computer Science homepage This page maintained by reports@cs.cmu.edu