Machine Learning Department
School of Computer Science, Carnegie Mellon University


Detecting Anomalous Groups in
Categorical Datasets

Kaustav Das, Jeff Schneider, Daniel B. Neill

April 2009


Keywords: Pattern detection, anomaly detection, machine learning

We propose a new method for detecting groups of anomalies in categorical datasets. Our approach is a generalization of the spatial scan statistic, a commonly used method for detecting clusters of increased counts in spatial data. We extend this framework to non-spatial datasets with discrete valued attributes, where the degree of anomalousness of each record depends on its attribute values and we wish to find self-similar groups of anomalous records. We model the relationship between the attributes using a probabilistic model (e.g. Bayesian network), define a likelihood ratio statistic in terms of the pseudo-likelihoods for the null and alternative hypotheses, and maximize this statistic over all subsets of records. Since an exhaustive search over all such groups is computationally infeasible, we propose an efficient (but approximate) search heuristic. We show that this algorithm is able to accurately detect anomalous groups in real-world hospital, container shipping and network connections data.

21 pages

SCS Technical Report Collection
School of Computer Science homepage

This page maintained by