|国家预印本平台
首页|MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

来源:Arxiv_logoArxiv
英文摘要

Pretrained vision-language models (VLMs), \eg CLIP, are increasingly used to bridge the gap between open- and close-vocabulary recognition in open-vocabulary image segmentation. As VLMs are generally pretrained with low-resolution images (e.g. $224\times224$), most previous methods operate only on downscaled images. We question this design as low resolution features often fail to preserve fine details. A typical solution is to employ additional image backbones for high-resolution inputs, but it also introduce significant computation overhead. Therefore, we propose MROVSeg, a multi-resolution training framework for open-vocabulary image segmentation with a single pretrained CLIP backbone, that uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. Its key components include a Multi-Res Adapter, which restores the spatial geometry and grasps local-global correspondences across patches by interacting with multi-resolution features. To achieve accurate segmentation, we introduce Multi-grained Masked Attention scheme to aggregate multi-grained semantics from multi-resolution CLIP features to object queries. Through comprehensive experiments, we demonstrate the superiority of MROVSeg on well-established open-vocabulary image segmentation benchmarks, establishing new standards for open-vocabulary image segmentation.

Jinqiao Wang、Ming Tang、Yunfang Niu、Yingying Chen、Yuanbing Zhu、Bingke Zhu

计算技术、计算机技术

Jinqiao Wang,Ming Tang,Yunfang Niu,Yingying Chen,Yuanbing Zhu,Bingke Zhu.MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation[EB/OL].(2024-08-27)[2025-05-24].https://arxiv.org/abs/2408.14776.点此复制

评论