Computer Science Department
School of Computer Science, Carnegie Mellon University


Scalable and Practical Probability Density
Estimators for Scientific Anomaly Detection

Dan Pelleg

May 2004

Ph.D. Thesis

Keywords: Machine learning, scientific discovery, mixture models, probability density estimators, efficient algorithms

Originally, astronomers dealt with stars. Later, with galaxies. Today, large scale cosmological structures are so complex, they must be first reduced into more succinct representations. For example, a universe simulation containing millions of objects is characterized by its halo occupation distribution.

This progression is typical of many disciplines of science, and even resonates in our daily lives. The easier it is for us to collect new data, store it and manage it, the harder it becomes to keep up with what it all means. For that we need to develop tools capable of mining big data sets.

This new generation of data analysis tools must meet the following requirements. They have to be fast and scale well to big data. Their output has to be straightforward to understand and easy to visualize. They need to only ask for the minimum of user input - ideally they would run completely autonomously once given the data.

I focus on clustering. Its main advantage is its generality. Separating data into groups of similar objects reduces the perception problem significantly. In this context, I propose new algorithms and tools to meet the challenges: an extremely fast spatial clustering algorithm, which can also estimate the number of clusters; a novel and highly comprehensible mixture model; a sub-linear learner for dependency trees; and an active learning framework to minimize the burden on a human expert hunting for rare anomalies. I implemented the algorithms and used them with very large data sets in a wide variety of applications, including astrophysics.

166 pages

Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by