RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization
RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization
Visual manipulation localization (VML) -- across both images and videos -- is a crucial task in digital forensics that involves identifying tampered regions in visual content. However, existing methods often lack cross-modal generalization and struggle to handle high-resolution or long-duration inputs efficiently. We propose RelayFormer, a unified and modular architecture for visual manipulation localization across images and videos. By leveraging flexible local units and a Global-Local Relay Attention (GLoRA) mechanism, it enables scalable, resolution-agnostic processing with strong generalization. Our framework integrates seamlessly with existing Transformer-based backbones, such as ViT and SegFormer, via lightweight adaptation modules that require only minimal architectural changes, ensuring compatibility without disrupting pretrained representations. Furthermore, we design a lightweight, query-based mask decoder that supports one-shot inference across video sequences with linear complexity. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art localization performance, setting a new baseline for scalable and modality-agnostic VML. Code is available at: https://github.com/WenOOI/RelayFormer.
Wen Huang、Jiarui Yang、Tao Dai、Jiawei Li、Shaoxiong Zhan、Bin Wang、Shu-Tao Xia
计算技术、计算机技术
Wen Huang,Jiarui Yang,Tao Dai,Jiawei Li,Shaoxiong Zhan,Bin Wang,Shu-Tao Xia.RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization[EB/OL].(2025-08-13)[2025-08-24].https://arxiv.org/abs/2508.09459.点此复制
评论