CMU-CS-23-119
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-23-119

Beyond Model Efficiency: Data Optimizations
for Machine Learning Systems

Michael Roman Kuchnik

Ph.D. Thesis

May 2023

CMU-CS-23-119.pdf


Keywords: Machine Learning Systems

The field of machine learning, particularly deep learning, has witnessed tremen- dous recent advances due to improvements in algorithms, compute, and datasets. Systems built to support deep learning have primarily targeted computations used to produce the learned model. This thesis proposes to instead focus on the role of data in both training and validation. For the first part of the thesis, we focus on training data, demonstrating that the data pipeline responsible for training data is a prime target for performance considerations. To aid in addressing performance issues, we introduce a form of data subsampling in the space of data transformations, a reduced fidelity I/O format, and a system for automatically tuning data pipeline performance knobs. In the second part of the thesis, motivated by the trend toward increasingly large and expressive models, we turn to the validation setting, developing a system for automatically querying and validating a large language model's behavior with standard regular expressions. We conclude with future work in the space of data systems for machine learning.

147 pages

Thesis Committee:
George Amvrosiadis (Co-chair)
Virginia Smith (Co-chair)
Tianqi Chen
Greg Ganger
Paul Barham (Google)

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu