|国家预印本平台
首页|HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing

HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing

HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing

来源:Arxiv_logoArxiv
英文摘要

The attention layer, a core component of Transformer-based LLMs, brings out inefficiencies in current GPU systems due to its low operational intensity and the substantial memory requirements of KV caches. We propose a High-bandwidth Processing Unit (HPU), a memoryintensive co-processor that enhances GPU resource utilization during large-batched LLM inference. By offloading memory-bound operations, the HPU allows the GPU to focus on compute-intensive tasks, increasing overall efficiency. Also, the HPU, as an add-on card, scales out to accommodate surging memory demands driven by large batch sizes and extended sequence lengths. In this paper, we show the HPU prototype implemented with PCIe-based FPGA cards mounted on a GPU system. Our novel GPU-HPU heterogeneous system demonstrates up to 4.1x performance gains and 4.6x energy efficiency improvements over a GPUonly system, providing scalability without increasing the number of GPUs.

Myunghyun Rhee、Joonseop Sim、Taeyoung Ahn、Seungyong Lee、Daegun Yoon、Euiseok Kim、Kyoung Park、Youngpyo Joo、Hosik Kim

计算技术、计算机技术

Myunghyun Rhee,Joonseop Sim,Taeyoung Ahn,Seungyong Lee,Daegun Yoon,Euiseok Kim,Kyoung Park,Youngpyo Joo,Hosik Kim.HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing[EB/OL].(2025-04-17)[2025-05-10].https://arxiv.org/abs/2504.16112.点此复制

评论