MACHINE LEARNING TECHNICAL REPORT ABSTRACTS

	CMU-ML-07-106 Machine Learning Department School of Computer Science, Carnegie Mellon University CMU-ML-07-106 Using Distributed M-Trees for Answering K-Nearest Neighbor Queries Brent Bryan, Andrew W. Moore, Andrew Snyder, Jeff Schneider April 2007 CMU-ML-07-106.pdf Keywords:** M-trees, parallelization, k-Nearest Neighbor The proliferation of large dynamic data sets allows for unprecedented learning opportunities. However, collections of data are only a valuable resource if methods exist for quickly finding and extracting relevant information. Hence, it is desirable to store this data in an indexing structure, such as a tree. Traditional tree-based data structures typically store as much of the tree as possible in RAM, as writing nodes to the disk results in a significant performance hit. In order to eliminate the disk-based bottleneck, several techniques have been proposed to store large indexing trees over a series of machines. As the tree is no longer stored on a single machine, a trade off between tree balance and insertion cost (due to machine communication) arises. In this work, we present a general framework for maintaining a tree structure over parallel resources in a dynamic environment. We show that the technique results in a dynamic tree structure with k-nn query rates similar to those of the optimal tree for uniform data sets and significantly better when the data is either skewed or dynamic. In particular, the algorithm is ideally suited for querying server logs. 23 pages Google Pittsburgh *Robotics Institute, Carnegie Mellon University

SCS Technical Report Collection School of Computer Science homepage This page maintained by reports@cs.cmu.edu