Computer Science Department
School of Computer Science, Carnegie Mellon University
Parameter-Free Spatial and Stream Mining
Recently, in addition to the data warehouse model where data from multiple sources are integrated into a large store, the streaming model is emerging as an alternative data processing paradigm. Several applications produce a continuous stream of data (e.g., phone call detail records, web clickstreams or sensor measurements) that is too large to store in its entirety. Therefore, in stream mining we are allowed only a single pass over the data, without random access. Upon arrival of new observations, we have to incrementally update the data model. Furthermore, space complexity must be sublinear with respect to dataset size.
In this thesis we develop spatial and stream mining tools for discovery of interesting patterns. These patterns summarize the data, enable forecasting of future trends and spotting of anomalies or outliers. Beyond the emphasis on efficiency and scalability, we focus on simplifying or eliminating user intervention. Data mining algorithms must make the discovery task easy for average users. Unfortunately, many of the existing techniques require non-trivial user intervention at several steps of the process. Eliminating the requirement for user intervention should be a top priority in designing data mining methods.
We show that multi-resolution analysis (i.e., examining the data at multiple resolutions or scales) is a powerful tool towards these goals. In particular, for spatial data we employ the correlation integral. For time series streams we use the wavelet transform and related techniques. Furthermore, we leverage tools from signal processing (again wavelets and, also, subspace tracking algorithms) to extract patterns from streams. Finally, we also employ compression principles coupled with multi-level partitionings to automatically cluster spatial data.
The first two parts of this thesis focus on spatial mining methods. In the first part we examine homogeneous spatial data, where all points belong to one class. In the second part we examine heterogeneous spatial data, where the points may belong to two or more different classe (e.g., species, galaxy types, etc). Finally, in the third part we focus on numerical, time series streams and mining techniques for both single and multiple streams.