Computer Science Department
School of Computer Science, Carnegie Mellon University


A Distributed and Scalable Peer-to-Peer
Content Discovery System Supporting Complex Queries

Jun Gao

October 2004

Ph.D. Thesis

Keywords: Content discovery, peer-to-peer, distributed hash table, rendezvous point, load balancing, range query, similarity query

A Content Discovery System (CDS) is a system that allows nodes in the system to discover content published by other nodes. CDSes are an essential component of many distributed applications including wide-area service discovery systems, peer-to-peer (P2P) systems, and sensor networks. However, existing applications have difficulties in achieving both rich searchability and good scalability. For example, applications built directly on top of Distributed Hash Tables (DHTs) are scalable but only allow exact name lookup. Peer-to-peer file sharing systems such as Gnutella and KaZaA, on the other hand, offer searching capability, but their flooding-based searching mechanism does not scale with the number of queries.

In this thesis, we present the design, implementation and evaluation of a distributed CDS that meets the scalability and searchability requirements simultaneously. Nodes in the CDS organize themselves into a structured P2P overlay network in a distributed fashion using a DHT. The CDS supports the search of highly dynamic contents represented using a descriptive attribute-value based naming scheme. We ensure scalability by deploying efficient registration and query algorithms using Rendezvous Points (RPs). To maintain high system throughput under realistic skewed load, such as flash crowds, we designed a novel mechanism that uses Load Balancing Matrices (LBMs) to eliminate hot-spots by balancing both registration and query load in a fully distributed fashion. To support complex queries, we developed two new distributed data structures, namely the Range Search Tree and Distributed KD-Tree. These data structures coupled with a set of lightweight tree-based protocols, allow the CDS to support range and similarity queries efficiently without creating bottlenecks.

To evaluate the CDS, we implemented a comprehensive simulator. Our extensive simulation results confirmed the effectiveness of the CDS design. In addition, we implemented a prototype of the CDS that includes the RP-based registration and query mechanisms with distributed load balancing. Our deployment of the prototype on the Internet (Planet Lab testbed) and its integration with a content-based music information retrieval system verified the system's feasibility and its applicability to a large category of real world applications.

164 pages

Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by