|
|
CMU-CS-25-130 Computer Science Department School of Computer Science, Carnegie Mellon University
Towards Effortless High-Performance Jinqi (Kathryn) Chen M.S. Thesis August 2025
Recent advances in large language models (LLMs) have pushed GPU hardware to its limits, requiring highly optimized kernels for compute- and bandwidth-intensive operations such as matrix multiplication, attention, and inter-GPU communication. However, achieving state-of-the-art efficiency often demands deep low-level expertise, slowing development and limiting accessibility. This thesis presents TIR+, a multi-level compiler framework that unifies high-level productivity and low-level optimization within a single compilation and runtime infrastructure. TIR+ spans from a Python-based tiling DSL, enabling rapid kernel prototyping, to a hardware-centric intermediate representation (IR), offering fine-grained control over memory, parallelism, and specialized instructions. Between these extremes, it provides optimized tensor libraries and reusable primitives. Crucially, TIR+ is distributed-aware, supporting multi-GPU execution with built-in communication management and compute–communication overlap. We demonstrate the capability of TIR+ through key LLM kernels, such as GEMM, attention, and fused compute–communication kernels. Among these cases, TIR+ matches the state-of-the-art performance with significantly less development effort than hand-tuned CUDA, demonstrating a unified and scalable path toward hardware-aware kernel optimization for current and future AI workloads. 49 pages
Thesis Committee:
Srinivasan Seshan, Head, Computer Science Department
|
|
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |
|