首页|A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks

A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks

来源：

英文摘要

We investigate 17 benchmarks (e.g. SugarCREPE, VALSE) commonly used for measuring compositional understanding capabilities of vision-language models (VLMs). We scrutinize design choices in their construction, including data source (e.g. MS-COCO) and curation procedures (e.g. constructing negative images/captions), uncovering several inherent biases across most benchmarks. We find that blind heuristics (e.g. token-length, log-likelihood under a language model) perform on par with CLIP models, indicating that these benchmarks do not effectively measure compositional understanding. We demonstrate that the underlying factor is a distribution asymmetry between positive and negative images/captions, induced by the benchmark construction procedures. To mitigate these issues, we provide a few key recommendations for constructing more robust vision-language compositional understanding benchmarks, that would be less prone to such simple attacks.

作者：Vishaal Udandarao、Mehdi Cherti、Shyamgopal Karthik、Jenia Jitsev、Samuel Albanie、Matthias Bethge

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Vishaal Udandarao,Mehdi Cherti,Shyamgopal Karthik,Jenia Jitsev,Samuel Albanie,Matthias Bethge.A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks[EB/OL].(2025-06-09)[2025-07-16].https://arxiv.org/abs/2506.08227.点此复制

A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks

A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks

评论