CMU-CS-25-153
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-25-153

An Analytical Framework for Operation
Performance in Heterogeneous LLM Serving

Shiqi Pan

M.S. Thesis

December 2025

CMU-CS-25-153.pdf
Currently Unavailable Electronically


Keywords: LLM Inference, Heterogeneous Hardware, Pipeline Serving

Large language models have grown beyond single-GPU capacity, necessitating distributed inference approaches. Pipeline serving has emerged as a standard deployment method, partitioning models into sequential stages executed on separate devices with point-to-point communication at stage boundaries. As both model architectures and datacenter hardware have become increasingly heterogeneous in their compute-memory characteristics, disaggregated serving has evolved to address this diversity by splitting models at finer granularities of sublayers or individual operations and assigning them to different resource types based on their computational and memory-access profiles.

Prior work has demonstrated the effectiveness of disaggregated serving. Beyond traditional pipeline parallelism that divides models into equal layer groups, recent disaggregation systems assign partitions to hardware types based on their compute-memory bottlenecks.

However, current disaggregation strategies rely on fixed heuristics designed for target model types. These static partitioning rules work well on target model architectures and target hardware but cannot be systematically extended beyond. With the increasing heterogeneity of model architectures as well as hardware, a formal and generalizable understanding of operations' compute-memory characteristics across different types of hardware is essential.

This thesis develops an analytical framework for operation-level serving performance on heterogeneous GPUs. We derive analytical models for linear projections, full attention, and sliding window attention, analyzing their compute costs, memory costs, arithmetic intensity, and how these properties vary with batch size, sequence length, and hardware type. We then extend these ideal analytical models with parametric latency models that incorporate efficiency factors to account for hardware underutilization, fitting these parameters through empirical profiling of Gemma3 27B on H100 and H20 GPUs. The fitted parametric models accurately capture real-world performance while revealing insights that analytical models alone cannot predict. Our analysis shows that operations exhibit dramatically different efficiency characteristics–from near-perfect compute utilization (94% of theoretical peak) to substantial memory bandwidth under-utilization ((lowest 8% efficiency)–and that hardware selection requires understanding operation-specific efficiency, not just peak specifications.

This work provides the analytical foundation for operation-level performance characterization in heterogeneous LLM serving, enabling systematic reasoning about disaggregation beyond fixed partitioning heuristics.

52 pages

Thesis Committee:
Rashmi K. Vinayak (Chair)
Zhihao Jia

Jignesh Patel, Interim Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: School of Computer Science

This page maintained by reports@cs.cmu.edu