CMU-CS-04-121
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMU-CS-04-121
Finding (Recently) Frequent Items in Distributed Data Streams
Amit Manjhi, Vladislav Shkapenyuk,
Kedar Dhamdhere and Christopher Olston
April 2004
CMU-CS-04-121.ps
CMU-CS-04-121.pdf
Keywords: Streams and stream-based processing, optimization and
performance
We consider the problem of maintaining frequency counts for items
occurring frequently in the union of multiple distributed data
streams. Naive methods of combining approximate frequency counts from
multiple nodes tend to result in excessively large data structures
that are costly to transfer among nodes. To minimize communication
requirements, the degree of precision maintained by each node while
counting item frequencies must be managed carefully. We introduce the
concept of a precision gradient for managing precision when nodes are
arranged in a hierarchical communication structure. We then study the
optimization problem of how to set the precision gradient so as to
minimize communication, and provide optimal solutions that minimize
worst-case communication load over all possible inputs. We then
introduce a variant designed to perform well in practice, with input
data that does not conform to worst-case characteristics. We verify
the effectiveness of our approach empirically using real-world data,
and show that our methods incur substantially less communication than
naive approaches while providing the same error guarantees on
answers.
In addition, we extend techniques for maintaining frequency counts of
high-frequency items in one or more streams by making them
time-sensitive. Time-sensitivity is achieved by associating weights
with items that decay exponentially with time. We analyze the error
bounds and worst-case space bounds for the extended algorithms.
24 pages
|