|国家预印本平台
首页|Beyond the Buzz: A Pragmatic Take on Inference Disaggregation

Beyond the Buzz: A Pragmatic Take on Inference Disaggregation

Beyond the Buzz: A Pragmatic Take on Inference Disaggregation

来源:Arxiv_logoArxiv
英文摘要

As inference scales to multi-node deployments, disaggregation - splitting inference into distinct phases - offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. Our findings offer actionable insights for efficient disaggregated deployments to navigate the trade-off between system throughput and interactivity.

Ritika Borkar、Shivam Raj、Nidhi Bhatia、Ramon Matas、Dheevatsa Mudigere、Ritchie Zhao、Maximilian Golub、Arpan Dutta、Sailaja Madduri、Brian Pharris、Dharmesh Jani、Bita Darvish Rouhani、Tiyasa Mitra

计算技术、计算机技术

Ritika Borkar,Shivam Raj,Nidhi Bhatia,Ramon Matas,Dheevatsa Mudigere,Ritchie Zhao,Maximilian Golub,Arpan Dutta,Sailaja Madduri,Brian Pharris,Dharmesh Jani,Bita Darvish Rouhani,Tiyasa Mitra.Beyond the Buzz: A Pragmatic Take on Inference Disaggregation[EB/OL].(2025-06-05)[2025-06-21].https://arxiv.org/abs/2506.05508.点此复制

评论