|国家预印本平台
首页|Rethinking Vision Transformer for Large-Scale Fine-Grained Image Retrieval

Rethinking Vision Transformer for Large-Scale Fine-Grained Image Retrieval

Rethinking Vision Transformer for Large-Scale Fine-Grained Image Retrieval

来源:Arxiv_logoArxiv
英文摘要

Large-scale fine-grained image retrieval (FGIR) aims to retrieve images belonging to the same subcategory as a given query by capturing subtle differences in a large-scale setting. Recently, Vision Transformers (ViT) have been employed in FGIR due to their powerful self-attention mechanism for modeling long-range dependencies. However, most Transformer-based methods focus primarily on leveraging self-attention to distinguish fine-grained details, while overlooking the high computational complexity and redundant dependencies inherent to these models, limiting their scalability and effectiveness in large-scale FGIR. In this paper, we propose an Efficient and Effective ViT-based framework, termed \textbf{EET}, which integrates token pruning module with a discriminative transfer strategy to address these limitations. Specifically, we introduce a content-based token pruning scheme to enhance the efficiency of the vanilla ViT, progressively removing background or low-discriminative tokens at different stages by exploiting feature responses and self-attention mechanism. To ensure the resulting efficient ViT retains strong discriminative power, we further present a discriminative transfer strategy comprising both \textit{discriminative knowledge transfer} and \textit{discriminative region guidance}. Using a distillation paradigm, these components transfer knowledge from a larger ``teacher'' ViT to a more efficient ``student'' model, guiding the latter to focus on subtle yet crucial regions in a cost-free manner. Extensive experiments on two widely-used fine-grained datasets and four large-scale fine-grained datasets demonstrate the effectiveness of our method. Specifically, EET reduces the inference latency of ViT-Small by 42.7\% and boosts the retrieval performance of 16-bit hash codes by 5.15\% on the challenging NABirds dataset.

Xin Jiang、Hao Tang、Yonghua Pan、Zechao Li

计算技术、计算机技术

Xin Jiang,Hao Tang,Yonghua Pan,Zechao Li.Rethinking Vision Transformer for Large-Scale Fine-Grained Image Retrieval[EB/OL].(2025-04-23)[2025-05-10].https://arxiv.org/abs/2504.16691.点此复制

评论