首页|SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking

SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking

来源：

英文摘要

Text-based person retrieval aims to identify a target individual from a gallery of images based on a natural language description. It presents a significant challenge due to the complexity of real-world scenes and the ambiguity of appearance-related descriptions. Existing methods primarily emphasize appearance-based cross-modal retrieval, often neglecting the contextual information embedded within the scene, which can offer valuable complementary insights for retrieval. To address this, we introduce SCENEPERSON-13W, a large-scale dataset featuring over 100,000 scenes with rich annotations covering both pedestrian appearance and environmental cues. Based on this, we propose SA-Person, a two-stage retrieval framework. In the first stage, it performs discriminative appearance grounding by aligning textual cues with pedestrian-specific regions. In the second stage, it introduces SceneRanker, a training-free, scene-aware re-ranking method leveraging multimodal large language models to jointly reason over pedestrian appearance and the global scene context. Experiments on SCENEPERSON-13W validate the effectiveness of our framework in challenging scene-level retrieval scenarios. The code and dataset will be made publicly available.

作者：Daming Gao、Yang Yang、Zhen Lei、Min Cao、Jinlin Wu、Zhen Chen、Yingjia Xu

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Daming Gao,Yang Yang,Zhen Lei,Min Cao,Jinlin Wu,Zhen Chen,Yingjia Xu.SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking[EB/OL].(2025-06-26)[2025-06-29].https://arxiv.org/abs/2505.24466.点此复制

SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking

SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking

评论