COMPUTER SCIENCE TECHNICAL REPORT ABSTRACTS

CMU-CS-23-110
Computer Science Department
School of Computer Science, Carnegie Mellon University

CMU-CS-23-110

Practical Coding-Theoretic Tools for Machine Learning
Systems and by Machine Learning Systems

Jack Kosaian

Ph.D. Thesis

April 2023

CMU-CS-23-110.pdf

Keywords: Fault tolerance, machine learning, coding theory, machine learning systems

Machine learning (ML) is now used in many domains, such as web services and safety-critical systems. This has led to the development of ML systems for deploying and training ML models. Beyond achieving high accuracy, ML systems must also use computing infrastructure efficiently and tolerate unreliable infrastructure.

Coding-theoretic tools enable many systems to operate reliably without the significant resource overhead that accompanies replication-based approaches. These tools are used in production storage and communication systems, and there is growing interest in their use for distributed computing.

This thesis explores the interplay between ML systems and practical applications of coding theory. Specifically, we show how ML systems can be made more reliable and efficient via novel uses of coding-theoretic tools, and how coding-theoretic tools can be expanded in reach and be made more efficient through techniques from ML and ML systems. We illustrate this interaction via multiple thrusts:

(1) We show how properties unique to ML systems can be exploited to efficiently integrate coding-theoretic fault tolerance techniques into ML systems. First, we reduce the execution-time overhead of fault-tolerant inference on GPUs by up to 5.3 x by exploiting trends in neural network design and GPU hardware. Second, we show how coding-theoretic tools can be coupled with the unique properties of recommendation models to enable low-overhead fault tolerance in training.
(2) We demonstrate that co-designing coding-theoretic tools with ML systems offers new opportunities to extend these tools beyond prior limitations. Specifically, we enable resource-efficient fault tolerance in distributed prediction serving systems by using ML to overcome a key barrier in prior coding-theoretic tools.
(3) We identify opportunities for ideas inspired by coding theory to be used to improve the performance of ML systems, even when reliability is not a concern. We show that the throughput and GPU utilization of specialized convolutional neural network inference can be improved by up to 2.5 x by combining images in a coding-theory-inspired manner and making small modifications to the model architecture.
(4) Finally, we show that the encoding and decoding functions of one popular class of coding-theoretic tools, linear codes, can operate at higher throughput and with little developer effort via advancements in ML systems. We show how similarities between operations in linear codes and those in ML libraries enable linear codes to be represented via ML libraries, and thus allow libraries for computing linear codes to adopt the many optimizations that have gone into ML libraries. This approach outperforms custom libraries for computing linear codes by up to 2.2 x.

Through these thrusts, this thesis demonstrates the promise of using coding-theoretic in ML systems and ideas from ML systems in coding-theoretic tools to bring about the next generation of efficient and reliable systems.

193 pages

Thesis Committee:
Rashmi Vinayak (Chair)
Phillip Gibbons
Zico Kolter
Ion Stoica (University of California, Berkeley)
Pramod Viswanath (Princeton University)

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu