|国家预印本平台
首页|FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

来源:Arxiv_logoArxiv
英文摘要

Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2. Code available at https://github.com/thunlp/FR-Spec.

Kaihuo Zhang、Maosong Sun、Ao Sun、Weilin Zhao、Yuxuan Li、Jianyong Wang、Tengyu Pan、Yuxiang Huang、Zhiyuan Liu、Xu Han、Yudi Zhang、Weilun Zhao

语言学

Kaihuo Zhang,Maosong Sun,Ao Sun,Weilin Zhao,Yuxuan Li,Jianyong Wang,Tengyu Pan,Yuxiang Huang,Zhiyuan Liu,Xu Han,Yudi Zhang,Weilun Zhao.FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling[EB/OL].(2025-02-20)[2025-05-08].https://arxiv.org/abs/2502.14856.点此复制

评论