|国家预印本平台
首页|FlashBias: Fast Computation of Attention with Bias

FlashBias: Fast Computation of Attention with Bias

FlashBias: Fast Computation of Attention with Bias

来源:Arxiv_logoArxiv
英文摘要

Attention mechanism has emerged as a foundation module of modern deep learning models and has also empowered many milestones in various domains. Moreover, FlashAttention with IO-aware speedup resolves the efficiency issue of standard attention, further promoting its practicality. Beyond canonical attention, attention with bias also widely exists, such as relative position bias in vision and language models and pair representation bias in AlphaFold. In these works, prior knowledge is introduced as an additive bias term of attention weights to guide the learning process, which has been proven essential for model performance. Surprisingly, despite the common usage of attention with bias, its targeted efficiency optimization is still absent, which seriously hinders its wide applications in complex tasks. Diving into the computation of FlashAttention, we prove that its optimal efficiency is determined by the rank of the attention weight matrix. Inspired by this theoretical result, this paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases and a fast-accurate approximation for biases in general formalization. FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1.5$\times$ speedup for AlphaFold, and over 2$\times$ speedup for attention with bias in vision and language models without loss of accuracy.

Haixu Wu、Minghao Guo、Yuezhou Ma、Yuanxu Sun、Jianmin Wang、Wojciech Matusik、Mingsheng Long

计算技术、计算机技术

Haixu Wu,Minghao Guo,Yuezhou Ma,Yuanxu Sun,Jianmin Wang,Wojciech Matusik,Mingsheng Long.FlashBias: Fast Computation of Attention with Bias[EB/OL].(2025-05-17)[2025-06-25].https://arxiv.org/abs/2505.12044.点此复制

评论