CMU-ML-19-100
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-19-100

Anomaly Detection in Graphs and Time Series:
Algorithms and Applications

Bryan Hooi

March 2019

Ph.D. Thesis

CMU-ML-19-100.pdf


Keywords: Anomaly detection, unsupervised, graph, time series, dynamic graphs, fraud detection, tensor


How can we detect fraudsters in large online review networks, or power grid failures using electrical sensor data? With the increasing availablility of web-scale graphs and high-frequency sensor data, anomaly detection in massive datasets has seen growing focus. Social networks such as Facebook and Twitter contain up to billions of users. Similarly, large scale sensor data includes networks of traffic speed detectors that span freeway systems in major metropolitan areas, networks of voltage sensors spanning the electrical power grid, as well as numerous types of industrial, weather and environmental sensors. These datasets have created an increasing need for scalable algorithms that can automatically analyze this data and flag users or events which are anomalous or of interest.

This thesis focuses on these problems, by developing scalable, principled algorithms that detect unusual behavior or events. We focus on the use of connectivity and temporal information, which allow us to detect anomalies in large graphs such as social networks, as well as for sensor datasets such as traffic data, which contain multiple sensors varying over time, that are also arranged on a graph.

First,we focus on static graphs, in which only connectivity information is present. For example, how can we detect fraudsters in a user-product review graph, or a Twitter follower-followee graph? We rst propose a probabilistic approach that evaluates how surprising a dense subgraph is, while avoiding Erdos-Renyi assumptions of existing methods. We then consider how to detect dense subgraphs in a way that prevents anomalous users from evading detection by manipulating their features. Our approach improves detection accuracy by up to 70% F-measure over comparable baselines, and detects a Twitter subgraph of more than 4000 accounts, a majority of which used follower-buying services.

Next, we consider time series: for example, howcanwe detect anomalous events, such as electrical component failures in power grid time series? We develop algorithms for modelling and detecting anomalies in discrete time-series data, such as ratings, where a set of users rate a set of products; and real-valued power grid data, in which we use physics-based circuit models to accurately model and detect anomalies. Then, for mixed categorical, numeric and ordinal data, we propose an online nonparametric anomaly detection approach, that detects anomalies with 61% higher F-measure than related baselines.

Finally, merging graphs and time series, we consider graphs with sensors. Consider a set of sensors arranged in a graph, each collecting data over time: for example, traffic speed sensors, which are arranged on a road network, or voltage sensors on a power grid network. We develop algorithms for detecting anomalous events or large changes happening on a subset of the graph nodes, such as traffic accidents or power line failures. Additionally, we propose algorithms for near-optimally selecting locations for new sensors to be placed on a power grid graph, improving the detection of electrical component failures by 59% or more F-measure.

222 pages

Thesis Committee:
Christos Faloutsos (Chair)
David Choi
Leman Akoglu
Vipin Kumar (University of Minnesota)

Roni Rosenfeld, Head, Machine Learning Department
Tom M. Mitchell, Interim Dean, School of Computer Science


SCS Technical Report Collection
School of Computer Science