|国家预印本平台
首页|Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods

Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods

Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods

来源:Arxiv_logoArxiv
英文摘要

Prior human-object interaction (HOI) detection methods have integrated early vision-language models (VLMs) such as CLIP, but only as supporting components within their frameworks. In contrast, recent advances in large, generative VLMs suggest that these models may already possess strong ability to understand images involving HOI. This naturally raises an important question: can general-purpose standalone VLMs effectively solve HOI detection, and how do they compare with specialized HOI methods? Answering this requires a benchmark that can accommodate both paradigms. However, existing HOI benchmarks such as HICO-DET were developed before the emergence of modern VLMs, and their evaluation protocols require exact matches to annotated HOI classes. This is poorly aligned with the generative nature of VLMs, which often yield multiple valid interpretations in ambiguous cases. For example, a static image may capture a person mid-motion with a frisbee, which can plausibly be interpreted as either "throwing" or "catching". When only "catching" is annotated, the other, though equally plausible for the image, is marked incorrect when exact matching is used. As a result, correct predictions might be penalized, affecting both VLMs and HOI-specific methods. To avoid penalizing valid predictions, we introduce a new benchmark that reformulates HOI detection as a multiple-answer multiple-choice task, where each question includes only ground-truth positive options and a curated set of negatives that are constructed to reduce ambiguity (e.g., when "catching" is annotated, "throwing" is not selected as a negative to avoid penalizing valid predictions). The proposed evaluation protocol is the first of its kind for both VLMs and HOI methods, enabling direct comparison and offering new insight into the current state of progress in HOI understanding.

Bo Wang、Qinqian Lei、Robby T. Tan

计算技术、计算机技术

Bo Wang,Qinqian Lei,Robby T. Tan.Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods[EB/OL].(2025-08-26)[2025-09-05].https://arxiv.org/abs/2508.18753.点此复制

评论