CMU-ML-09-112
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-09-112

Fast Algorithms for Querying and Mining Large Graphs

Hanghang Tong

September 2009

Ph.D. Thesis

CMU-ML-09-112.pdf


Keywords: Network analysis, querying, mining, complex user specific pattern, trend analysis, scalability, immunization, anomaly detection, proximity, spectral analysis, low-rank-approximation

Graphs appear in a wide range of settings and have posed a wealth of fascinating problems. In this thesis, we focus on two types of tasks according to the interaction with users: (1) querying (e.g., given a social network, how to measure the closeness between two persons? how to track it over time?) and (2) mining (e.g., how to identify abnormal behaviors of computer networks? In the case of virus attacks, which nodes are the best to immunize?).

The task of querying includes three sub-tasks. In the first one, we found that many complex user-specific patterns on large graphs can be answered by means of proximity measurement. In other words, proximity allows us to query large graphs on the atomic level. We support our claim by conducting three case studies (connection subgraphs, user feedback, and gateway), all of which (despite their diversity) rely on the proximity measurement as their building block. The proposed algorithms are operational, with careful design and numerous optimizations. For the second sub-task, in order to adapt the querying task to time-evolving graphs, we proposed an efficient algorithm to track proximity on time-evolving graphs, which enables us to do trend analysis on the graph level. The proposed algorithm is up to 176x faster than competitors and has no quality loss. Finally, in order to handle the scalability issue in the task of querying, we developed a family of fast solutions to compute the proximity in several different scenarios. By carefully leveraging some important properties shared by many real graphs (e.g., the block-wise structure, the linear correlation, the skewness of real bipartite graphs, etc), we can often achieve orders of magnitude of speedup with little or no quality loss.

The task of mining also includes three sub-tasks. In the first one, we proposed an algorithm (NetShield) for immunization under the SIS model. While straight-forward methods are computationally intractable (O((nk) m)), the proposed algorithm is nearoptimal, fast (up to 7 orders of magnitude speedup), and scalable (O(nk2 + m)). In the second sub-task, we proposed a family of example-based low-rank matrix approximation methods for anomaly detection. The proposed algorithms are provably equal to or better than the best known methods in both space and time, with the same accuracy. On real data sets, it is up to 112x faster than the best competitors, for the same accuracy. Finally, we showed that graphs also provide a powerful tool to solve some complex problems. As a case study, we proposed a general framework to mine complex time stamped events (e.g., to find similar time stamps, to find abnormal time stamps and to provide interpretations for our findings, etc) by envisioning the problem as a graph analysis problem. We further proposed MT3 to handle multiple-scale analysis, achieving up to 2 orders of magnitude speedup, with the same quality.

210 pages


SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu