CMU-CS-20-108
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-20-108

Mining Anomalies using Static and Dynamic Graphs

Dhivya Eswaran

Ph.D. Thesis

May 2020

CMU-CS-20-108.pdf


Keywords: Anomaly detection, graphs, time-series, unsupervised algorithms, semi-supervised algorithms, streaming algorithms

Detection of anomalies, i.e. rare or unusual patterns, is a pressing problem in a number of contexts such as security, healthcare, finance and the web. Anomalies such as review fraud and network intrusion attacks encode suspicious, fraudulent or malicious behavior and do not just influence people into making sub-optimal decisions but also steadily erode their trust in businesses. As such, algorithms to detect on going anomalies and warn against upcoming anomalies have high impact for businesses and end-users alike.

This thesis considers the problem of anomaly detection by developing principled, scalable algorithms that detect unusual behavior or events by leveraging connectivity and temporal information. These approaches are useful for large dynamic complex datasets having strong relational and temporal characteristics, with multiple entities interacting with each other and also evolving overtime. Such datasets are generated in a multitude of diverse contexts today, with examples ranging from e-commerce logs to online social networks to the internet-of-things.

The first half of the thesis focuses on anomaly detection in graphs where only static connectivity information is known. Given a graph, and a few labeled vertices, how can we infer the labels for the remaining vertices? For example, how can we spot all fake user accounts on Amazon or Facebook from a small set of manually labeled honest and fake accounts? Compared to existing literature, our work leverages three key properties of real-world graphs, namely, heterogeneity in vertex and edge types, skewed degree distributions, and higher-order structures, to yield more accurate vertex labeling. The proposed algorithms have closed-form solutions, rigorous convergence guarantees, can be efficiently implemented using sparse matrix operations, and scale linearly with graph size.

The second half of the thesis focuses on mining anomalies from data where the connectivity structure evolves over time. In many settings, especially those relating to security and health care, the value of a new found or anticipated anomaly lies in the moment, and not later. Thus, given a time-evolving graph (explicit or implicit), how can we detect anomalies or events in near real-time, or perhaps even early warn before their occurrence? Our algorithms can detect anomalous graph footprints such as sudden appearance or disappearance of dense subgraphs and bridge edges in near real-time, by only storing a small synopsis of the graph seen so far and requiring no supervision. We also show how to infer state-transition graph from time series data in an online manner and use that to early warn against user-labeled anomalies such as adverse medical conditions.

Throughout the thesis, a strong emphas is is placed on algorithms which are not only (a) effective in practice, but are also (b) efficient, processing millions of edges in under a few seconds on a stock laptop, and are (c) principled, can be reasoned about rigorously, yielding theoretical guarantees for inference, detection, or leveraging data-related insights. We demonstrate the efficacy of our algorithms in a range of applications from social networks and e-commerce to security and health care.

182 pages

Thesis Committee:
Christos Faloutsos (Chair)
Aarti Singh
Zico Kolter
Nina Mishra (Amazon)

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu