|国家预印本平台
首页|Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

来源:Arxiv_logoArxiv
英文摘要

Video-based Visible-Infrared Person Re-Identification (VVI-ReID) aims to match pedestrian sequences across modalities by extracting modality-invariant sequence-level features. As a high-level semantic representation, language provides a consistent description of pedestrian characteristics in both infrared and visible modalities. Leveraging the Contrastive Language-Image Pre-training (CLIP) model to generate video-level language prompts and guide the learning of modality-invariant sequence-level features is theoretically feasible. However, the challenge of generating and utilizing modality-shared video-level language prompts to address modality gaps remains a critical problem. To address this problem, we propose a simple yet powerful framework, video-level language-driven VVI-ReID (VLD), which consists of two core modules: invariant-modality language prompting (IMLP) and spatial-temporal prompting (STP). IMLP employs a joint fine-tuning strategy for the visual encoder and the prompt learner to effectively generate modality-shared text prompts and align them with visual features from different modalities in CLIP's multimodal space, thereby mitigating modality differences. Additionally, STP models spatiotemporal information through two submodules, the spatial-temporal hub (STH) and spatial-temporal aggregation (STA), which further enhance IMLP by incorporating spatiotemporal information into text prompts. The STH aggregates and diffuses spatiotemporal information into the [CLS] token of each frame across the vision transformer (ViT) layers, whereas STA introduces dedicated identity-level loss and specialized multihead attention to ensure that the STH focuses on identity-relevant spatiotemporal feature aggregation. The VLD framework achieves state-of-the-art results on two VVI-ReID benchmarks. The code will be released at https://github.com/Visuang/VLD.

Shuang Li、Jiaxu Leng、Changjiang Kuang、Mingpi Tan、Xinbo Gao

计算技术、计算机技术

Shuang Li,Jiaxu Leng,Changjiang Kuang,Mingpi Tan,Xinbo Gao.Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification[EB/OL].(2025-06-03)[2025-06-24].https://arxiv.org/abs/2506.02439.点此复制

评论