|国家预印本平台
首页|Exploring Localization for Self-supervised Fine-grained Contrastive Learning

Exploring Localization for Self-supervised Fine-grained Contrastive Learning

Exploring Localization for Self-supervised Fine-grained Contrastive Learning

来源:Arxiv_logoArxiv
英文摘要

Self-supervised contrastive learning has demonstrated great potential in learning visual representations. Despite their success in various downstream tasks such as image classification and object detection, self-supervised pre-training for fine-grained scenarios is not fully explored. We point out that current contrastive methods are prone to memorizing background/foreground texture and therefore have a limitation in localizing the foreground object. Analysis suggests that learning to extract discriminative texture information and localization are equally crucial for fine-grained self-supervised pre-training. Based on our findings, we introduce cross-view saliency alignment (CVSA), a contrastive learning framework that first crops and swaps saliency regions of images as a novel view generation and then guides the model to localize on foreground objects via a cross-view alignment loss. Extensive experiments on both small- and large-scale fine-grained classification benchmarks show that CVSA significantly improves the learned representation.

Zelin Zang、Siyuan Li、Stan Z. Li、Di Wu

计算技术、计算机技术

Zelin Zang,Siyuan Li,Stan Z. Li,Di Wu.Exploring Localization for Self-supervised Fine-grained Contrastive Learning[EB/OL].(2021-06-29)[2025-06-04].https://arxiv.org/abs/2106.15788.点此复制

评论