CMU-CS-05-170
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-05-170

Parameter-Free Spatial and Stream Mining

Spiros Papadimitriou

September 2005

Ph.D. Thesis

CMU-CS-05-170.pdf


Keywords: Data mining, outliers, time series, streams, clustering


Data mining is the extraction of knowledge from large amounts of data and brings together the fields of databases, machine learning and statistics. Technological advances have enabled the creation of huge data repositories. The need to turn such data into useful knowledge has fueled the development of data mining techniques. From a database perspective, the emphasis is often placed on scalability and efficiency. Practical approaches can afford only a single or, at best, a few passes over the data, i.e., the algorithmic complexity must be linear with respect to dataset size.

Recently, in addition to the data warehouse model where data from multiple sources are integrated into a large store, the streaming model is emerging as an alternative data processing paradigm. Several applications produce a continuous stream of data (e.g., phone call detail records, web clickstreams or sensor measurements) that is too large to store in its entirety. Therefore, in stream mining we are allowed only a single pass over the data, without random access. Upon arrival of new observations, we have to incrementally update the data model. Furthermore, space complexity must be sublinear with respect to dataset size.

In this thesis we develop spatial and stream mining tools for discovery of interesting patterns. These patterns summarize the data, enable forecasting of future trends and spotting of anomalies or outliers. Beyond the emphasis on efficiency and scalability, we focus on simplifying or eliminating user intervention. Data mining algorithms must make the discovery task easy for average users. Unfortunately, many of the existing techniques require non-trivial user intervention at several steps of the process. Eliminating the requirement for user intervention should be a top priority in designing data mining methods.

We show that multi-resolution analysis (i.e., examining the data at multiple resolutions or scales) is a powerful tool towards these goals. In particular, for spatial data we employ the correlation integral. For time series streams we use the wavelet transform and related techniques. Furthermore, we leverage tools from signal processing (again wavelets and, also, subspace tracking algorithms) to extract patterns from streams. Finally, we also employ compression principles coupled with multi-level partitionings to automatically cluster spatial data.

The first two parts of this thesis focus on spatial mining methods. In the first part we examine homogeneous spatial data, where all points belong to one class. In the second part we examine heterogeneous spatial data, where the points may belong to two or more different classe (e.g., species, galaxy types, etc). Finally, in the third part we focus on numerical, time series streams and mining techniques for both single and multiple streams.

244 pages


Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu