首页|Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

来源：

英文摘要

Large language model (LLM)-based inference workloads increasingly dominate data center costs and resource utilization. Therefore, understanding the inference workload characteristics on evolving CPU-GPU coupled architectures is crucial for optimization. This paper presents an in-depth analysis of LLM inference behavior on loosely-coupled (PCIe A100/H100) and closely-coupled (GH200) systems. We analyze performance dynamics using fine-grained operator-to-kernel trace analysis, facilitated by our novel profiler SKIP and metrics like Total Kernel Launch and Queuing Time (TKLQT). Results show that closely-coupled (CC) GH200 significantly outperforms loosely-coupled (LC) systems at large batch sizes, achieving 1.9x-2.7x faster prefill latency for Llama 3.2-1B. However, our analysis also reveals that GH200 remains CPU-bound up to 4x larger batch sizes than LC systems. In this extended CPU-bound region, we identify the performance characteristics of the Grace CPU as a key factor contributing to higher inference latency at low batch sizes on GH200. We demonstrate that TKLQT accurately identifies this CPU/GPU-bound transition point. Based on this analysis, we further show that kernel fusion offers significant potential to mitigate GH200's low-batch latency bottleneck by reducing kernel launch overhead. This detailed kernel-level characterization provides critical insights for optimizing diverse CPU-GPU coupling strategies. This work is an initial effort, and we plan to explore other major AI/DL workloads that demand different degrees of CPU-GPU heterogeneous architectures.

作者：Prabhu Vellaisamy、Thomas Labonte、Sourav Chakraborty、Matt Turner、Samantika Sury、John Paul Shen

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Prabhu Vellaisamy,Thomas Labonte,Sourav Chakraborty,Matt Turner,Samantika Sury,John Paul Shen.Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures[EB/OL].(2025-04-16)[2025-04-29].https://arxiv.org/abs/2504.11750.点此复制

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

评论