首页|Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

来源：

英文摘要

The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining main-stream prominent LLM benchmarks using results from diverse models. We first propose a new framework for accurate and reliable estimations of item characteristics and model abilities. Specifically, we propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. Based on PSN-IRT, we conduct extensive analysis which reveals significant and varied shortcomings in the measurement quality of current benchmarks. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.

作者：Hongli Zhou、Hui Huang、Ziqing Zhao、Lvyuan Han、Huicheng Wang、Kehai Chen、Muyun Yang、Wei Bao、Jian Dong、Bing Xu、Conghui Zhu、Hailong Cao、Tiejun Zhao

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Hongli Zhou,Hui Huang,Ziqing Zhao,Lvyuan Han,Huicheng Wang,Kehai Chen,Muyun Yang,Wei Bao,Jian Dong,Bing Xu,Conghui Zhu,Hailong Cao,Tiejun Zhao.Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory[EB/OL].(2025-05-20)[2025-06-01].https://arxiv.org/abs/2505.15055.点此复制

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

评论