CMU-CS-24-113
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-24-113

Towards an OS for GPUs:
Threadblock Scheduling for DL Workloads

Brian E. Zhang

M.S. Thesis

May 2024

CMU-CS-24-113.pdf


Keywords: GPUs, Operating Systems, CUDA, Scheduling

As the year over year performance gains of CPUs has stagnated with the death of Moore's Law, GPUs and other data parallel chips have seen a surge in demand particularly for use in datacenter deep learning workloads. In spite of the growing demand, many companies are unable to fully utilize the hardware that is already in their datacenters. In fact, Alibaba reported a median GPU utilization of less than 10% in 2020. This number implies vast over-provisioning and shows the benefits to be gained via GPU multi-tenancy.

Just as multi-tenancy with traditional CPU architectures is facilitated with an OS, we believe that an OS can similarly solve this problem for GPUs. In this thesis we describe the design and implementation of the compute scheduler of AxOS, an OS for data parallel accelerators. AxOS allows for transparency, high GPU utilization, performance isolation, and spatial stacking between multiple processes using the GPU. To achieve this, AxOS has a novel threadblock-centric approach to GPU compute scheduling via the virtual streams and kernel chunking. We evaluate AxOS on a ResNet50 training and inference collocation scenario to demonstrate these benefits. We find that AxOS outperforms existing hardware-layer sharing solutions.

53 pages

Thesis Committee:
Dimitrios Skarlatos(Chair)
Todd C. Mowry

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu