Computer Science Department
School of Computer Science, Carnegie Mellon University
A Scalable Database Approach to
As we ride on a wave of technological innovation towards the age of petascale computing, massive input data models and voluminous output data sets involved in large-scale parallel computer simulations and scientific instruments will soon (if not already) overwhelm our ability to generate, manipulate, and interpret them. Although efficient and effective for processing relatively small data sets, traditional main-memory--based algorithms lack the necessary scalability to deal with massive data sets that require more memory than that is available. To address the new data challenge, we must seek a more scalable solution.
Inspired by the success of database management systems in managing huge information stores for the commercial and governmental sectors, we propose a database approach to computing and managing large-scale scientific data sets. Although unstructured scientific data sets do not map to the tabular representation of a database system naturally, we resolve the mismatch by introducing a computational cache on top of a standard database buffer pool and providing a mechanism to translate data between the inherent unstructured representation (stored in the computational cache) and the native database format (stored in database pages). Scientific application codes then operate directly on the computational cache instead of on the native database format.
We have implemented our methodology in a prototype system called Abacus that targets Delaunay triangulations, a commonly used unstructured data representation used by many scientific and engineering applications. Abacus can not only store and index existing Delaunay triangulations but also generate large-scale Delaunay triangulations from scratch. Evaluation results show that:
Although some of the techniques presented in this dissertation are specific to Delaunay triangulations, the proposition of a database approach is more general; it points to a promising new way of how to leverage and extend existing database techniques to compute and manage large-scale unstructured scientific data sets of the coming era.