Enhanced Vision-Language Models for Diverse Sensor Understanding: Cost-Efficient Optimization and Benchmarking
Enhanced Vision-Language Models for Diverse Sensor Understanding: Cost-Efficient Optimization and Benchmarking
Large-scale Vision-Language Models (VLMs) have achieved notable progress in aligning visual inputs with text. However, their ability to deeply understand the unique physical properties of non-RGB vision sensor images remains limited. In this paper, we revisit and analyze these limitations and introduce a novel, cost-efficient paradigm that significantly advances sensor image understanding-without requiring extensive training data or any modifications to the existing VLM architectures. Specifically, we propose Sensor-Aware Attributes Fine-Tuning (SAFT) with the Diverse Negative Attributes (DNA) optimization, which leverages minimal sensor-specific data to enable robust learning of non-RGB characteristics and overcome RGB-centric biases inherent in current VLMs. In addition, we present VS-TDX-the first comprehensive, public benchmark designed to rigorously evaluate VLMs' sensor-specific understanding across diverse and realistic scenarios. Through extensive experiments on VLMs and various sensor modalities, we validate that our method consistently delivers superior performance and generalization under resource-constrained and architecture-invariant settings. Our approach provides a practical advance towards scalable deployment of VLMs in increasingly sensor-diverse real-world environments.
Youngchae Chee、Sangyun Chung、Youngjoon Yu、Se Yeon Kim、Yong Man Ro
计算技术、计算机技术
Youngchae Chee,Sangyun Chung,Youngjoon Yu,Se Yeon Kim,Yong Man Ro.Enhanced Vision-Language Models for Diverse Sensor Understanding: Cost-Efficient Optimization and Benchmarking[EB/OL].(2025-08-01)[2025-08-18].https://arxiv.org/abs/2412.20750.点此复制
评论