MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion
MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion
The combination of Spiking Neural Networks(SNNs) with Vision Transformer architectures has attracted significant attention due to the great potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT, a novel spike-driven Transformer architecture, which firstly uses multi-scale spiking attention (MSSA) to enrich the capability of spiking attention blocks. We validate our approach across various main data sets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.
Wei Hua、Chenlin Zhou、Jibin Wu、Yansong Chua、Yangyang Shu
计算技术、计算机技术
Wei Hua,Chenlin Zhou,Jibin Wu,Yansong Chua,Yangyang Shu.MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion[EB/OL].(2025-05-19)[2025-06-17].https://arxiv.org/abs/2505.14719.点此复制
评论