Computer Science Department
School of Computer Science, Carnegie Mellon University
Automated Physical Design:
One of the most challenging tasks for the database administrator is physically designing the database (by selecting design features such as indexes, materialized views, and partitions) to attain optimal performance for a given workload. These features, however, impose storage and maintenance overhead on the database, thus requiring precise selection to balance the performance and the overhead. As the space of the design features is vast, and their interactions hard to quantify, the DBAs spend enormous amount of resources to identify the optimal set of features.
The difficulty of the problem has lead to several physical design tools to automatically decide the design given the data and a representative workload. The state-of-the-art design tools rely on the query optimizer for comparing between physical design alternatives, and search for the optimal set of features. Although it provides an appropriate cost model for physical design, query optimization is a computationally expensive process. Other than affecting the performance of the design tools, the overhead of optimization also limits the physical design tools from searching the space thoroughly – forcing them to prune away the search space to find solutions within a reasonable time. So far it has been impossible to remove query optimization overhead without sacrificing cost estimation precision. Inaccuracies in query cost estimation are detrimental to the quality of physical design algorithms, as they increase the chances of "missing" good designs and consequently selecting sub-optimal ones. Precision loss and the resulting reduction in solution quality is particularly undesirable and it is the reason the query optimizer is used in the first place.
In this thesis, we claim that for the physical design problem, the costs returned by the optimizer contain an intuitive mathematical model. By utilizing this model, the physical design problem can be converted to a compact convex optimization problem with integer variables and solved efficiently to attain near-optimal solutions using mature off-the-shelf solvers.
This approach eliminates the tradeoff between query cost estimation accuracy and performance. We invoke the optimizer a small number of times, and then reuse the results of the invocation to create an accurate model. We demonstrate the usefulness of the model by finding near-optimal physical design for workloads containing thousands of queries and thousands of candidate design alternatives. In a more complex online workload scenario, we devise several algorithms with guaranteed competitive bounds for the physical design problem. The proposed online algorithms provide significant speedups while imposing reasonable overhead on the system. This thesis, demonstrates that optimizer–the most complex component of the DBMS–can be modeled in a restricted (yet important) domain. The same approach can be extended to other domains to build accurate and efficient models for the optimization problems, and optimal solutions can be searched in a principled manner.