|国家预印本平台
首页|Affordance Benchmark for MLLMs

Affordance Benchmark for MLLMs

Affordance Benchmark for MLLMs

来源:Arxiv_logoArxiv
英文摘要

Affordance theory posits that environments inherently offer action possibilities that shape perception and behavior. While Multimodal Large Language Models (MLLMs) excel in vision-language tasks, their ability to perceive affordance, which is crucial for intuitive and safe interactions, remains underexplored. To address this, we introduce A4Bench, a novel benchmark designed to evaluate the affordance perception abilities of MLLMs across two dimensions: 1) Constitutive Affordance}, assessing understanding of inherent object properties through 1,282 question-answer pairs spanning nine sub-disciplines, and 2) Transformative Affordance, probing dynamic and contextual nuances (e.g., misleading, time-dependent, cultural, or individual-specific affordance) with 718 challenging question-answer pairs. Evaluating 17 MLLMs (nine proprietary and eight open-source) against human performance, we find that proprietary models generally outperform open-source counterparts, but all exhibit limited capabilities, particularly in transformative affordance perception. Furthermore, even top-performing models, such as Gemini-2.0-Pro (18.05% overall exact match accuracy), significantly lag behind human performance (best: 85.34%, worst: 81.25%). These findings highlight critical gaps in environmental understanding of MLLMs and provide a foundation for advancing AI systems toward more robust, context-aware interactions. The dataset is available in https://github.com/JunyingWang959/A4Bench/.

Junying Wang、Wenzhe Li、Yalun Wu、Yingji Liang、Yijin Guo、Chunyi Li、Haodong Duan、Zicheng Zhang、Guangtao Zhai

计算技术、计算机技术

Junying Wang,Wenzhe Li,Yalun Wu,Yingji Liang,Yijin Guo,Chunyi Li,Haodong Duan,Zicheng Zhang,Guangtao Zhai.Affordance Benchmark for MLLMs[EB/OL].(2025-06-01)[2025-07-16].https://arxiv.org/abs/2506.00893.点此复制

评论