Computer Science Department
School of Computer Science, Carnegie Mellon University
Scalable and Practical Probability Density
This progression is typical of many disciplines of science, and even resonates in our daily lives. The easier it is for us to collect new data, store it and manage it, the harder it becomes to keep up with what it all means. For that we need to develop tools capable of mining big data sets.
This new generation of data analysis tools must meet the following requirements. They have to be fast and scale well to big data. Their output has to be straightforward to understand and easy to visualize. They need to only ask for the minimum of user input - ideally they would run completely autonomously once given the data.
I focus on clustering. Its main advantage is its generality. Separating data into groups of similar objects reduces the perception problem significantly. In this context, I propose new algorithms and tools to meet the challenges: an extremely fast spatial clustering algorithm, which can also estimate the number of clusters; a novel and highly comprehensible mixture model; a sub-linear learner for dependency trees; and an active learning framework to minimize the burden on a human expert hunting for rare anomalies. I implemented the algorithms and used them with very large data sets in a wide variety of applications, including astrophysics.