MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
While multimodal large language models (MLLMs) have demonstrated extraordinary vision-language understanding capabilities, their abilities to solve instance-level visual-language problems beyond a single image warrant further exploration. To assess these unproven abilities of MLLMs, this paper proposes a new visual grounding task called multi-context visual grounding, which aims to localize instances of interest across multiple images based on open-ended text prompts. In order to facilitate this research, we construct a new dataset MC-Bench that features 2K high-quality and manually annotated samples. Each sample consists of an instance-level labeled image pair and a corresponding text prompt that indicates the target instances in the images. These text prompts are highly open-ended and follow three distinct styles, covering 20 practical skills. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities, along with our developed simple yet effective agentic baseline and a finetuned baseline by multi-context instruction tuning. Our evaluation reveals a non-trivial performance gap between existing MLLMs and humans, along with some insightful observations that suggest potential future directions. We hope that MC-Bench and our empirical findings encourage the research community to further advance the untapped potentials of MLLMs in instance-level tasks, particularly in multi-image contexts. Project page: https://xuyunqiu.github.io/MC-Bench.
Yunqiu Xu、Linchao Zhu、Yi Yang
计算技术、计算机技术
Yunqiu Xu,Linchao Zhu,Yi Yang.MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs[EB/OL].(2025-07-22)[2025-08-15].https://arxiv.org/abs/2410.12332.点此复制
评论