CMU-CS-19-129
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-19-129

Replicated Training in Self-Driving Database Management Systems

Gustavo E. Angulo Mezerhane

M.S. Thesis

December 2019

CMU-CS-19-129.pdf


Keywords: Database Systems, Replication, Machine Learning

Self-driving database management systems (DBMSs) are a new family of DBMSs that can optimize themselves for better performance without human intervention. Self-driving DBMSs use machine learning (ML) models that predict system behaviors and make planning decisions based on the workload the system sees. These ML models are trained using metrics produced by different components running inside the system. Self-driving DBMSs are a challenging environment for these models, as they require a significant amount of training data that must be representative of the specific database the model is running on. To obtain such data, self-driving DBMSs must generate this training data themselves in an online setting. This data generation, however, imposes a performance overhead during query execution.

To deal with this performance overhead, we propose a novel technique named Replicated Training that leverages the existing distributed master-replica architecture of a self-driving DBMS to generate training data for models. As opposed to generating training data solely in the master node. Replicated Training load balances this resource-intensive task across the distributed replica nodes. Under Replicated Training, each replica dynamically controls training data collection if it needs more resources to keep up with the master node. To show the effectiveness of our technique, we implement it in NoisePage, a self-driving DBMS, and evaluate it in a distributed environment. Our experiments show that training data collection in a DBMS incurs a noticeable 11% performance overhead in the master node, and using Replicated Training eliminates this overhead in the master node while still ensuring that replicas keep up with the master with low delay. Finally, we show that Replicated Training produces ML models that have accuracies comparable to those trained solely on the master node.

61 pages

Thesis Committee:
Andrew Pavlo (Chair)
David G. Andersen

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu