首页|MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

来源：

英文摘要

We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.

作者：Xiang Yue、Jiacheng Chen、Yuan Liu、Hexiang Hu、Tianhao Liang、Wenhu Chen、Sherman Siu、Zhengqing Wang、Kai Wang、Yubo Wang、Yuansheng Ni、Wang Zhu、Ziyan Jiang、Bohan Lyu、Dongfu Jiang、Xuan He

作者单位：

学科分类：信息科学、信息技术计算技术、计算机技术

推荐引用：Xiang Yue,Jiacheng Chen,Yuan Liu,Hexiang Hu,Tianhao Liang,Wenhu Chen,Sherman Siu,Zhengqing Wang,Kai Wang,Yubo Wang,Yuansheng Ni,Wang Zhu,Ziyan Jiang,Bohan Lyu,Dongfu Jiang,Xuan He.MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks[EB/OL].(2025-07-13)[2025-07-25].https://arxiv.org/abs/2410.10563.点此复制

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

评论