首页|VSA: Faster Video Diffusion with Trainable Sparse Attention

VSA: Faster Video Diffusion with Trainable Sparse Attention

来源：

英文摘要

Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models. Code will be available at https://github.com/hao-ai-lab/FastVideo.

作者：Peiyuan Zhang、Haofeng Huang、Yongqi Chen、Will Lin、Zhengzhong Liu、Ion Stoica、Eric Xing、Hao Zhang

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Peiyuan Zhang,Haofeng Huang,Yongqi Chen,Will Lin,Zhengzhong Liu,Ion Stoica,Eric Xing,Hao Zhang.VSA: Faster Video Diffusion with Trainable Sparse Attention[EB/OL].(2025-05-19)[2025-06-23].https://arxiv.org/abs/2505.13389.点此复制

VSA: Faster Video Diffusion with Trainable Sparse Attention

VSA: Faster Video Diffusion with Trainable Sparse Attention

评论