首页|Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI

Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI

来源：

英文摘要

Inference is now the dominant AI workload, yet existing systems force trade-offs between latency, throughput, and cost. Arctic Inference, an open-source vLLM plugin from Snowflake AI Research, introduces Shift Parallelism, a dynamic parallelism strategy that adapts to real-world traffic while integrating speculative decoding, SwiftKV compute reduction, and optimized embedding inference. It achieves up to 3.4 times faster request completion, 1.75 times faster generation, and 1.6M tokens/sec per GPU for embeddings, outperforming both latency- and throughput-optimized deployments. Already powering Snowflake Cortex AI, Arctic Inference delivers state-of-the-art, cost-effective inference for enterprise AI and is now available to the community.

作者：Samyam Rajbhandari、Mert Hidayetoglu、Aurick Qiao、Ye Wang、Juncheng Yang、Jeff Rasley、Michael Wyatt、Yuxiong He

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Samyam Rajbhandari,Mert Hidayetoglu,Aurick Qiao,Ye Wang,Juncheng Yang,Jeff Rasley,Michael Wyatt,Yuxiong He.Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI[EB/OL].(2025-07-16)[2025-08-05].https://arxiv.org/abs/2507.11830.点此复制

Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI

Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI

评论