|国家预印本平台
| 注册
首页|An Inquiry into Datacenter TCO for LLM Inference with FP8

An Inquiry into Datacenter TCO for LLM Inference with FP8

An Inquiry into Datacenter TCO for LLM Inference with FP8

来源:Arxiv_logoArxiv
英文摘要

As large language models (LLMs) continue to scale, the high power consumption of AI accelerators in datacenters presents significant challenges, substantially increasing the total cost of ownership (TCO) for cloud service providers (CSPs) that provide LLM inference. In this work, we analyze the computational characteristics of LLM inference from a TCO perspective and present a generalizable framework to compare AI accelerators across diverse operational requirements. Using this model, we investigate key workload characteristics influencing TCO for AI accelerators from Intel (Gaudi 2 & 3) and NVIDIA (H100 & H200), especially thin GEMM utilization and FP8 quantization. In particular, as FP8 emerges as the baseline precision for next-generation LLMs, understanding how different architectures implement and benefit from low-precision computation is increasingly critical. Throughput on thin GEMMs has a greater impact on TCO than theoretical hardware peak throughput because the memory-bound decode phase is dominated by GEMV-like computations. We find that Gaudi HPUs achieve superior utilization on thin GEMMs compared to their counterparts, especially in FP8-quantized models. Our result underscores the importance of empirical, workload-level analysis in evaluating accelerator performance, rather than relying solely on theoretical hardware specifications. By studying the interaction between power consumption, quantization strategies, and hardware architecture, we provide insights to support informed deployment decisions and guide future accelerator designs aimed at improving the TCO of LLM inference workloads.

Jiwoo Kim、Joonhyung Lee、Gunho Park、Byeongwook Kim、Se Jung Kwon、Dongsoo Lee、Youngjoo Lee

计算技术、计算机技术

Jiwoo Kim,Joonhyung Lee,Gunho Park,Byeongwook Kim,Se Jung Kwon,Dongsoo Lee,Youngjoo Lee.An Inquiry into Datacenter TCO for LLM Inference with FP8[EB/OL].(2025-08-25)[2025-09-06].https://arxiv.org/abs/2502.01070.点此复制

评论