Computer Science Department
School of Computer Science, Carnegie Mellon University
Fast Algorithms for Mining
Time series data arise in many applications, from motion capture, environmental monitoring, temperatures in data centers, to physiological signals in health care. In the thesis, I will focus on the theme of learning and mining large collections of co-evolving sequences, with the goal of developing fast algorithms for finding patterns, summarization, and anomalies. In particular, this thesis will answer the following recurring challenges for time series:
1. Forecasting and imputation: How to do forecasting and to recover
missing values in time series data?
We develop models to mine time series with missing values, to extract compact representation from time sequences, to segment the sequences, and to do forecasting. For large scale data, we propose algorithms for learning time series models, in particular, including Linear Dynamical Systems (LDS) and Hidden Markov Models (HMM). We also develop a distributed algorithm for finding patterns in large web-click streams. Our thesis will present special models and algorithms that incorporate domain knowledge. For motion capture, we will describe the natural motion stitching and occlusion filling for human motion. In particular, we provide a metric for evaluating the naturalness of motion stitching, based which we choose the best stitching. Thanks to domain knowledge (body structure and bone lengths), our algorithm is capable of recovering occlusions in mocap sequences, better in accuracy and longer in missing period. We also develop an algorithm for forecasting thermal conditions in a warehouse-sized data center. The forecast will help us control and manage the data center in a energy-efficient way, which can save a significant percentage of electric power consumption in data centers.