CMU-ML-07-106
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-07-106

Using Distributed M-Trees for Answering
K-Nearest Neighbor Queries

Brent Bryan, Andrew W. Moore*, Andrew Snyder*, Jeff Schneider**

April 2007

CMU-ML-07-106.pdf


Keywords: M-trees, parallelization, k-Nearest Neighbor
The proliferation of large dynamic data sets allows for unprecedented learning opportunities. However, collections of data are only a valuable resource if methods exist for quickly finding and extracting relevant information. Hence, it is desirable to store this data in an indexing structure, such as a tree. Traditional tree-based data structures typically store as much of the tree as possible in RAM, as writing nodes to the disk results in a significant performance hit. In order to eliminate the disk-based bottleneck, several techniques have been proposed to store large indexing trees over a series of machines. As the tree is no longer stored on a single machine, a trade off between tree balance and insertion cost (due to machine communication) arises. In this work, we present a general framework for maintaining a tree structure over parallel resources in a dynamic environment. We show that the technique results in a dynamic tree structure with k-nn query rates similar to those of the optimal tree for uniform data sets and significantly better when the data is either skewed or dynamic. In particular, the algorithm is ideally suited for querying server logs.

23 pages

*Google Pittsburgh
**Robotics Institute, Carnegie Mellon University


SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu