CMU-CS-99-113
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMU-CS-99-113
Density Biased Sampling: An Improved Method for
Data Mining and Clustering
Christopher R. Palmer, Christos Faloutsos
May 1999
CMU-CS-99-113.ps
CMU-CS-99-113.pdf
Keywords: Random sampling spatial clustering, data mining
Data mining in large data sets often requires a sampling or
summarization step to form an in-core representation of the data that
can be processed more efficiently. Uniform random sampling is
frequently used in practice and also frequently criticized because it
will miss small clusters. Many natural phenomena are known to follow
Zipf's distribution and the inability of uniform sampling to find small
clusters is of practical concern. Density Biased Sampling is proposed
to probabilistically under-sample dense regions and over-sample light
regions. A weighted sample is used to preserve the densities of the
original data. Density biased sampling naturally includes uniform
sampling as a special case. A memory efficient algorithm is proposed
that approximates density biased sampling using only a single scan of
the data. We empirically evaluate density biased sampling using
synthetic data sets that exhibit varying cluster size distributions.
Our proposed method scales linearly and out performs uniform samples
when clustering realistic data sets.
23 pages
|