CMU-CS-19-101 Computer Science Department School of Computer Science, Carnegie Mellon University
Mining Large Dynamic Graphs and Tensors Kijung Shin Ph.D. Thesis February 2019
Graphs are ubiquitous, representing a variety of information, ranging from who follows whom on online social networks to who reviews what on e-commerce sites. Many of these graphs are large (e.g., online social networks with over two billion active users) and dynamic (i.e., nodes and edges can be added and removed over time). Moreover, they are with rich side information (e.g., e-commerce reviews with timestamps, ratings, and text) and thus naturally modeled as tensors (i.e., multi-dimensional arrays). Given large dynamic graphs and tensors, how can we analyze their structure? How can we detect interesting anomalies? Lastly, how can we model behaviors of individuals in the data? My thesis focuses on these closely related questions, all of which are fundamental to understand massive evolving data on user behavior. That is, we develop scalable algorithms for mining large dynamic graphs and tensors, with a focus on three tasks: 1. Structure Analysis: We build one-pass, sublinear-space algorithms that incrementally estimate the triangle count, which is an important connectivity measure, in large dynamic graphs. In particular, our distributed algorithm yields up to 39 × more accurate estimates faster than a baseline. We also develop distributed and out-of-core algorithms for succinctly but accurately summarizing the structure of large graphs and tensors. They summarize over 25 × larger data without quality loss than their best competitors. 2. Anomaly Detection: We develop near-linear time approximation algorithms for detecting unusually dense subgraphs and subtensors, which signal notable anomalies such as "edit wars" on Wikipedia and fake followers on Twitter. Especially, our tensor algorithm is up to 114 × faster without accuracy loss than the previously best heuristic. We also extend it for distributed or dynamic data with the same approximation guarantee. 3. Behavior Modeling: We design game-theoretic models for purchases of individuals in social networks and a fast algorithm for finding Nash equilibria of the models. In addition, we develop a stage model for the progression of individuals and a distributed optimization algorithm for fitting the model to behavior logs with trillions of records. Using our tools, we measure social inefficiency regarding purchase of sharable goods and discover progression patterns of LinkeIn users. To achieve the highest performance and scalability, our algorithms for the above tasks employ mathematical techniques (e.g., approximation and sampling), use distributed computing frameworks (e.g., MAPREDUCE and MPI), and/or exploit pervasive patterns in realworld data (e.g., power-law degree distribution). We successfully apply them to massive datasets, including 20:6 billion social connections on LinkedIn, 1:47 billion follow relations on Twitter, 783 million hyperlinks between web pages, 483 million edits on Wikipedia, and a synthetic tensor with 1 trillion non-zero entries 302 pages
Thesis Committee:
Srinivasan Seshan, Head, Computer Science Department
| |
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |