Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation
Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation
Recently, test-time adaptation has attracted wide interest in the context of vision-language models for image classification. However, to the best of our knowledge, the problem is completely overlooked in dense prediction tasks such as Open-Vocabulary Semantic Segmentation (OVSS). In response, we propose a novel TTA method tailored to adapting VLMs for segmentation during test time. Unlike TTA methods for image classification, our Multi-Level and Multi-Prompt (MLMP) entropy minimization integrates features from intermediate vision-encoder layers and is performed with different text-prompt templates at both the global CLS token and local pixel-wise levels. Our approach could be used as plug-and-play for any segmentation network, does not require additional training data or labels, and remains effective even with a single test sample. Furthermore, we introduce a comprehensive OVSS TTA benchmark suite, which integrates a rigorous evaluation protocol, seven segmentation datasets, and 15 common corruptions, with a total of 82 distinct test scenarios, establishing a standardized and comprehensive testbed for future TTA research in open-vocabulary segmentation. Our experiments on this suite demonstrate that our segmentation-tailored method consistently delivers significant gains over direct adoption of TTA classification baselines.
Mehrdad Noori、David Osowiechi、Gustavo Adolfo Vargas Hakim、Ali Bahri、Moslem Yazdanpanah、Sahar Dastani、Farzad Beizaee、Ismail Ben Ayed、Christian Desrosiers
计算技术、计算机技术
Mehrdad Noori,David Osowiechi,Gustavo Adolfo Vargas Hakim,Ali Bahri,Moslem Yazdanpanah,Sahar Dastani,Farzad Beizaee,Ismail Ben Ayed,Christian Desrosiers.Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation[EB/OL].(2025-05-27)[2025-06-21].https://arxiv.org/abs/2505.21844.点此复制
评论