OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases
OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases
Large Language Models (LLMs) have demonstrated substantial progress on reasoning tasks involving unstructured text, yet their capabilities significantly deteriorate when reasoning requires integrating structured external knowledge such as knowledge graphs, code snippets, or formal logic. This limitation is partly due to the absence of benchmarks capable of systematically evaluating LLM performance across diverse structured knowledge modalities. To address this gap, we introduce \textbf{\textsc{OneEval}}, a comprehensive benchmark explicitly designed to assess the knowledge-intensive reasoning capabilities of LLMs across four structured knowledge modalities, unstructured text, knowledge graphs, code, and formal logic, and five critical domains (general knowledge, government, science, law, and programming). \textsc{OneEval} comprises 4,019 carefully curated instances and includes a challenging subset, \textsc{OneEval}\textsubscript{Hard}, consisting of 1,285 particularly difficult cases. Through extensive evaluation of 18 state-of-the-art open-source and proprietary LLMs, we establish three core findings: a) \emph{persistent limitations in structured reasoning}, with even the strongest model achieving only 32.2\% accuracy on \textsc{OneEval}\textsubscript{Hard}; b) \emph{performance consistently declines as the structural complexity of the knowledge base increases}, with accuracy dropping sharply from 53\% (textual reasoning) to 25\% (formal logic); and c) \emph{diminishing returns from extended reasoning chains}, highlighting the critical need for models to adapt reasoning depth appropriately to task complexity. We release the \textsc{OneEval} datasets, evaluation scripts, and baseline results publicly, accompanied by a leaderboard to facilitate ongoing advancements in structured knowledge reasoning.
Yongrui Chen、Zhiqiang Liu、Jing Yu、Lin Ren、Nan Hu、Xinbang Dai、Jiajun Liu、Jiazhen Kang、Shenyu Zhang、Xinda Wang、Keyan Ding、Pengfei Shen、Haolei Zhu、Hongjie Deng、Yisong Wang、Tongtong Wu、Sheng Bi、Wen Zhang、Tianxing Wu、Qiu Ji、Haofen Wang、Wenliang Chen、Huajun Chen、Guilin Qi
信息传播、知识传播计算技术、计算机技术
Yongrui Chen,Zhiqiang Liu,Jing Yu,Lin Ren,Nan Hu,Xinbang Dai,Jiajun Liu,Jiazhen Kang,Shenyu Zhang,Xinda Wang,Keyan Ding,Pengfei Shen,Haolei Zhu,Hongjie Deng,Yisong Wang,Tongtong Wu,Sheng Bi,Wen Zhang,Tianxing Wu,Qiu Ji,Haofen Wang,Wenliang Chen,Huajun Chen,Guilin Qi.OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases[EB/OL].(2025-06-14)[2025-07-01].https://arxiv.org/abs/2506.12577.点此复制
评论