COMPUTER SCIENCE TECHNICAL REPORT ABSTRACTS

CMU-CS-06-124
Computer Science Department
School of Computer Science, Carnegie Mellon University

CMU-CS-06-124

Fast Nonparametric Machine Learning ALgorithms
for High-dimensional Massive Data and Applications

Ting Liu

March 2006

Ph.D. Thesis

Keywords: Nearest-neighbor, k-nearest-neighbor, approximate-nearest neighbor, metric-trees, KNS2, NNS3, spill-tree, LSH

Nonparametric methods have become increasingly popular in statistics and probabilistic AI communities. One well-known nonparametric method is "nearest-neighbor". It uses the observations in the training set T closest in input space to a query q to form the prediction of q. Specifically, when k of the observations in T are considered, it is called k-nearest-neighbor (or k-NN).

Despite its simplicity, knn and its variants have been successful in many machine learning problems, including pattern recognition, text categorization, information retrieval, computational statistics, database and data mining. It is also used for estimating sample distributions and Bayes error.

However, k-NN and many related nonparametric methods remain hampered by their computational complexity. Many spatial methods, such as metric-trees, have been proposed to alleviate the computational cost, but the effectiveness of these methods decreases as the number of dimensions of feature vectors increases. From another direction, researchers are trying to develop ways to find approximate answers. The premise of this research is that in many cases it is not necessary to insist on the exact answers; instead, determining an approximate answer should be sufficient. In fact, some approximate methods show good performance in a number of applications, and some methods enjoy very good theoretical soundness. However, when facing hundreds or thousands dimensions, many algorithms do not work well in reality.

I propose four new spatial methods for fast k-NN and its variants, namely KNS2, KNS3, IOC and spill-tree. The first three algorithms are designed to speed up knn classification problems, and they all share the same insight that finding the majority class among the knn of q need not to explicitly find those k-nearest-neighbors. Spill-tree is designed for approximate k-NN search. By adapting metric-trees to a more flexible data structure, spill-tree is able to adapt to the distribution of data and it scales well even for huge high-dimensional data sets. Significant efficiency improvement has been observed comparing to LSH (localify sensitive hashing), the state of art approximate k-NN algorithm. We applied spill-tree to three real-world applications: shot video segmentation, drug activity detection and image clustering, which I will explain in the thesis.

138 pages

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu