|国家预印本平台
首页|Evaluating LLMs with Multiple Problems at once

Evaluating LLMs with Multiple Problems at once

Evaluating LLMs with Multiple Problems at once

来源:Arxiv_logoArxiv
英文摘要

This paper shows the benefits and fruitfulness of evaluating LLMs with multiple problems at once, a paradigm we call multi-problem evaluation (MPE). Unlike conventional single-problem evaluation, where a prompt presents a single problem and expects one specific answer, MPE places multiple problems together in a single prompt and assesses how well an LLM answers all these problems in a single output. Leveraging 6 classification and 12 reasoning benchmarks that already exist, we introduce a new benchmark called ZeMPE (Zero-shot Multi-Problem Evaluation), comprising 53,100 zero-shot multi-problem prompts. We experiment with a total of 13 LLMs from 5 model families on ZeMPE to present a comprehensive and systematic MPE. Our results show that LLMs are capable of handling multiple problems from a single data source as well as handling them separately, but there are conditions this multiple problem handling capability falls short. In addition, we perform in-depth further analyses and explore model-level factors that may enable multiple problem handling capabilities in LLMs. We release our corpus and code to facilitate future research.

Zhengxiang Wang、Jordan Kodner、Owen Rambow

计算技术、计算机技术

Zhengxiang Wang,Jordan Kodner,Owen Rambow.Evaluating LLMs with Multiple Problems at once[EB/OL].(2025-06-21)[2025-07-17].https://arxiv.org/abs/2406.10786.点此复制

评论