首页|Mamba Drafters for Speculative Decoding

Mamba Drafters for Speculative Decoding

来源：

英文摘要

Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability.

作者：Daewon Choi、Seunghyuk Oh、Saket Dingliwal、Jihoon Tack、Kyuyoung Kim、Woomin Song、Seojin Kim、Insu Han、Jinwoo Shin、Aram Galstyan、Shubham Katiyar、Sravan Babu Bodapati

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Daewon Choi,Seunghyuk Oh,Saket Dingliwal,Jihoon Tack,Kyuyoung Kim,Woomin Song,Seojin Kim,Insu Han,Jinwoo Shin,Aram Galstyan,Shubham Katiyar,Sravan Babu Bodapati.Mamba Drafters for Speculative Decoding[EB/OL].(2025-06-01)[2025-06-29].https://arxiv.org/abs/2506.01206.点此复制

Mamba Drafters for Speculative Decoding

Mamba Drafters for Speculative Decoding

评论