首页|Efficient LLM Serving on Hybrid Real-time and Best-effort Requests

Efficient LLM Serving on Hybrid Real-time and Best-effort Requests

来源：

英文摘要

Recent breakthroughs in large Language Models (LLMs) have enabled various generative tasks on a single model. Real-world services (e.g., OpenAI's ChatGPT [27]) powered by an LLM often concurrently support latency-critical requests for interactive applications (e.g., question-answering systems, referred to as real-time or RT requests) and throughput-oriented requests for back-of-house processing (e.g., documents batch processing [28], referred to best-effort or BE requests), with complex hybrid inference workloads to the underlying model. State-of-the-art (SOTA) LLM serving systems dedicate machines to each type of request, towards either low inference latency or high serving throughput, respectively. This practice simplifies request scheduling and management but suffers from poor resource utilization. We propose BROS, a hybrid LLM serving system that aims to collocate RT/BE requests, meeting RT requests' latency requirements while maintaining BE requests' throughput. BROS formulates the problem of hybrid RT/BE request scheduling and solves it with a dynamic priority-based algorithm. BROS designs a bidirectional KV cache management mechanism, allowing RT requests to share KV memory with BE requests to remove the scheduling restrictions caused by insufficient KV memory and improve utilization. Extensive experiments validate that BROS achieves a good trade-off when serving hybrid RT and BE requests. It significantly reduces the latency of RT requests (up to 74.20%), improving their fine-grained service level objectives (SLOs) attainments (up to 36.38x), with negligible throughput reduction for BE requests, showing significant advantages over SOTA systems like vLLM and TGI.

作者：Wan Borui、Zhao Juntao、Jiang Chenyu、Guo Chuanxiong、Wu Chuan

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Wan Borui,Zhao Juntao,Jiang Chenyu,Guo Chuanxiong,Wu Chuan.Efficient LLM Serving on Hybrid Real-time and Best-effort Requests[EB/OL].(2025-04-13)[2025-04-26].https://arxiv.org/abs/2504.09590.点此复制

Efficient LLM Serving on Hybrid Real-time and Best-effort Requests

Efficient LLM Serving on Hybrid Real-time and Best-effort Requests

评论