|
CMU-CS-01-133
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMU-CS-01-133
NetCube: A Scalable Tool for Fast Data Mining and Compression
Dimitris Margaritis, Christos Faloutsos, Sebastian Thrun
May 2001
CMU-CS-01-133.ps
CMU-CS-01-133.pdf
Keywords: DataCube approximation, count query, Bayesian network,
Bayesian network structure learning, machine learning
We propose an novel method of computing and storing
DataCubes. Our idea is to use Bayesian Networks, which can
generate approximate counts for any query combination of
attribute values and "don't cares." A Bayesian network
represents the underlying joint probability distribution of
the data that were used to generate it. By means of such a
network the proposed method, NetCube, exploits correlations
among attributes. Our proposed preprocessing algorithm
scales linearly on the size of the database, and is thus
scalable; it is also parallelizable with a straightforward
parallel implementation. Moreover, we give an algorithm to
estimate counts of arbitrary queries that is fast (constant
on the database size). Experimental results show
that NetCubes have fast generation and use (a few minutes
preprocessing time per 100,000 records and less than a
second query time), achieve excellent compression (at least
1800:1 compression ratios on real data) and have low
reconstruction error (less than 5% on average). Moreover,
our method naturally allows for visualization and data
mining, at no extra cost.
21 pages
|