|
CMU-CS-03-159
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMU-CS-03-159
AutoPar: Automating Schema Design for Large
Scientific Databases Using Data Partitioning
Efstratios Papadomanolakis, Anastassia Ailamaki
July 2003
CMU-CS-03-159.ps
CMU-CS-03-159.pdf
Keywords: Relational databases, performance, self-tuning,
vertical partitioning
Database applications that use multi-terabyte datasets are becoming
increasingly important for scientific fields such as astronomy and biology.
Scientific databases are particularly suited for the application of
automated physical design techniques, because of their data volume and the
complexity of the scientific workloads. Current automated physical design
tools focus on the selection of indexes and materialized views. In
large-scale scientific databases, however, the data volume and the
continuous insertion of new data allows for only limited indexes and
materialized views. By contrast, data partitioning does not replicate data,
thereby reducing space requirements and minimizing update overhead. In this
paper we propose AutoPart, an algorithm that automatically partitions
database tables to optimize sequential access assuming prior knowledge of a
representative workload. The resulting schema is indexed using a fraction of
the space required for indexing the original schema. To evaluate AutoPart,
we build an automated schema design tool that interfaces to commercial
database systems. We experiment with AutoPart in the context of the Sloan
Digital Sky Survey database, a real-world astronomical database, running on
SQL Server 2000. Our experiments corroborate the benefits of partitioning
for large-scale systems: Partitioning alone improves query execution
performance by a factor of two on average. Combined with indexes, the new
schema also outperforms the indexed original schema by 20% (for queries) and
a factor of five (for updates), while using only half the original index
space.
15 pages
|