首页|Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU

Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU

来源：

英文摘要

Large language models are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. Moreover, rule-based mechanisms ignore the fusion opportunities of mixed-type operators and fail to adapt to various sequence lengths. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer via flexible masking and operator fusion on GPU. We firstly unify the storage format and kernel implementation for the multi-head attention. Then, we map fusion schemes to compilation templates and determine the optimal parameter setting through a two-stage search engine. The experimental results show that compared to the state-of-the-art work, STOF achieves maximum speedups of 1.7x in MHA computation and 1.5x in end-to-end inference.

作者：Wenhao Dai、Haodong Deng、Mengfei Rong、Xinyu Yang、Hongyu Liu、Fangxin Liu、Hailong Yang、Weifeng Liu、Qingxiao Sun

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Wenhao Dai,Haodong Deng,Mengfei Rong,Xinyu Yang,Hongyu Liu,Fangxin Liu,Hailong Yang,Weifeng Liu,Qingxiao Sun.Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU[EB/OL].(2025-06-06)[2025-07-21].https://arxiv.org/abs/2506.06095.点此复制

Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU

Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU

评论