CMU-CS-19-112 Computer Science Department School of Computer Science, Carnegie Mellon University
Supporting HybridWorkloads for In-Memory Database Management Tianyu Li M.S. Thesis May 2019
The proliferation of modern data processing ecosystems has given rise to open-source columnar data formats. The key advantage of these formats is that they allow organizations to load data from database management systems (DBMSs) once instead of having to convert it to a new format for each usage. These formats, however, are read-only. This means that organizations must still use a heavy-weight transformation process to load data from their original format into the desired columnar format. We aim to reduce or even eliminate this process by developing an in-memory storage management architecture for transactional DBMSs that is aware of the eventual usage of its data and operates directly on columnar storage blocks. We introduce relaxations to common analytical format requirements to efficiently update data, and rely on a lightweight in-memory transformation process to convert blocks back to analytical forms when they are cold. We also describe how to directly access data from third-party analytical tools with minimal serialization overhead. To evaluate our work, we implemented our storage engine based on the Apache Arrow format and integrated it into the CMDB DBMS. Our experiments show that our approach achieves comparable performance with dedicated OLTP DBMSs while also enabling orders of magnitude faster data exports to external data science and machine learning libraries than existing approaches. 74 pages
Thesis Committee:
Srinivasan Seshan, Head, Computer Science Department
| |
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |