首页|Towards Pareto Optimal Throughput in Small Language Model Serving

Towards Pareto Optimal Throughput in Small Language Model Serving

来源：

英文摘要

Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.

作者：Alaa Youssef、Chen Wang、Pol G. Recasens、Eun Kyung Lee、Yue Zhu、Olivier Tardieu、Jordi Torres、Josep Ll. Berral

作者单位：

DOI：10.1145/3642970.3655832

学科分类：计算技术、计算机技术

推荐引用：Alaa Youssef,Chen Wang,Pol G. Recasens,Eun Kyung Lee,Yue Zhu,Olivier Tardieu,Jordi Torres,Josep Ll. Berral.Towards Pareto Optimal Throughput in Small Language Model Serving[EB/OL].(2025-07-12)[2025-07-25].https://arxiv.org/abs/2404.03353.点此复制

Towards Pareto Optimal Throughput in Small Language Model Serving

Towards Pareto Optimal Throughput in Small Language Model Serving

评论