Computer Science Department
School of Computer Science, Carnegie Mellon University
Fast Anomaly Discovery given Duplicates
Jay-Yoon Lee, U Kang, Danai Koutra, Christos Faloutsos
Given a large cloud of multi-dimensional points, and an off-the-shelf outlier detection method, why does it take a week to finish? After careful analysis, we discovered that duplicate points create subtle issues, that the literature has ignored: if dmax is the multiplicity of the most overplotted point, typical algorithms are quadratic on dmax. For graph-related outlier detection, all the satellites of a 'star' node will have identical features, and thus create over-plotting with dmax being the highest degree; due to power law degree distributions, this may be huge, for real graph data. We propose several ways to eliminate the problem; we report wall-clock times and our time savings; and we show that our methods give either exact results, or highly accurate approximate ones.