SeerAttention-R: Sparse Attention Adaptation for Long Reasoning
SeerAttention-R: Sparse Attention Adaptation for Long Reasoning
We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: https://github.com/microsoft/SeerAttention.
Yizhao Gao、Shuming Guo、Shijie Cao、Yuqing Xia、Yu Cheng、Lei Wang、Lingxiao Ma、Yutao Sun、Tianzhu Ye、Li Dong、Hayden Kwok-Hay So、Yu Hua、Ting Cao、Fan Yang、Mao Yang
计算技术、计算机技术
Yizhao Gao,Shuming Guo,Shijie Cao,Yuqing Xia,Yu Cheng,Lei Wang,Lingxiao Ma,Yutao Sun,Tianzhu Ye,Li Dong,Hayden Kwok-Hay So,Yu Hua,Ting Cao,Fan Yang,Mao Yang.SeerAttention-R: Sparse Attention Adaptation for Long Reasoning[EB/OL].(2025-06-10)[2025-06-29].https://arxiv.org/abs/2506.08889.点此复制
评论