Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection
Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection
The rapid proliferation of surveillance cameras has increased the demand for automated violence detection. While CNNs and Transformers have shown success in extracting spatio-temporal features, they struggle with long-term dependencies and computational efficiency. We propose Dual Branch VideoMamba with Gated Class Token Fusion (GCTF), an efficient architecture combining a dual-branch design and a state-space model (SSM) backbone where one branch captures spatial features, while the other focuses on temporal dynamics, with continuous fusion via a gating mechanism. We also present a new benchmark by merging RWF-2000, RLVS, and VioPeru datasets in video violence detection, ensuring strict separation between training and testing sets. Our model achieves state-of-the-art performance on this benchmark offering an optimal balance between accuracy and computational efficiency, demonstrating the promise of SSMs for scalable, real-time surveillance violence detection.
Damith Chamalke Senadeera、Xiaoyun Yang、Dimitrios Kollias、Gregory Slabaugh
自动化技术、自动化技术设备计算技术、计算机技术
Damith Chamalke Senadeera,Xiaoyun Yang,Dimitrios Kollias,Gregory Slabaugh.Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection[EB/OL].(2025-05-23)[2025-07-21].https://arxiv.org/abs/2506.03162.点此复制
评论