|国家预印本平台
首页|VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models

VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models

VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models

来源:Arxiv_logoArxiv
英文摘要

Recent advancements in Large Video-Language Models (LVLMs) have led to promising results in multimodal video understanding. However, it remains unclear whether these models possess the cognitive capabilities required for high-level tasks, particularly those involving symbolic and abstract perception. Existing benchmarks typically rely on real-world, annotated videos, which lack control over video content and inherent difficulty, limiting their diagnostic power. To bridge this gap, we propose VideoCogQA, a scalable and fully controllable benchmark inspired by game-world environments, designed to evaluate the cognitive abilities of LVLMs. By generating synthetic videos via a programmatic engine, VideoCogQA allows fine-grained control over visual elements, temporal dynamics, and task difficulty. This approach enables a focused evaluation of video cognitive abilities, independent of prior knowledge from visual scene semantics. The dataset includes 800 videos and 3,280 question-answer pairs, featuring tasks related to abstract concepts, symbolic elements, and multimodal integration, with varying levels of difficulty. Experimental results show that even state-of-the-art (SOTA) models, such as GPT-4o, achieve an average performance of 48.8% on tasks involving abstract concepts. Additionally, performance drops by 15% as task complexity increases, highlighting the challenges LVLMs face in maintaining consistent performance. Through this work, we hope to show the limitations of current LVLMs and offer insights into how they can more effectively emulate human cognitive processes in the future.

Zhi Li、Chenglin Li、Yin Zhang、Feng Tao、Qianglong Chen

计算技术、计算机技术

Zhi Li,Chenglin Li,Yin Zhang,Feng Tao,Qianglong Chen.VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models[EB/OL].(2025-07-01)[2025-07-21].https://arxiv.org/abs/2411.09105.点此复制

评论