首页|RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

来源：

英文摘要

The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We introduce Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter-a draft LLM operating on shortened retrieval contexts-to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer that enriches the target distribution by RAG. Extensive experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both RAG and long-context LLMs, achieving significant performance improvements (e.g., from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B) with more than 2x speedups for long-context inference. Our analyses also reveal the robustness of RAPID across various context lengths and retrieval quality.

作者：Guanzheng Chen、Qilong Feng、Jinjie Ni、Xin Li、Michael Qizhe Shieh

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Guanzheng Chen,Qilong Feng,Jinjie Ni,Xin Li,Michael Qizhe Shieh.RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding[EB/OL].(2025-06-23)[2025-08-02].https://arxiv.org/abs/2502.20330.点此复制

RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

评论