|国家预印本平台
首页|WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

来源:Arxiv_logoArxiv
英文摘要

Large Language Models (LLMs) frequently output the label Unknown in reasoning tasks, where two scenarios may appear: (i) an input sample is genuinely unverifiable, but the model cannot understand why; and (ii) a verifiable problem that the model fails to solve, thus outputs Unknown. We refer to these cases collectively as the Vague Perception phenomenon. Current evaluations focus on whether such answers are honest, rather than analyzing the limits of LLM reasoning. To address this, we introduce WakenLLM, a framework that quantifies the portion of Unknown output attributable to model incapacity and evaluates whether stimulation can convert them into either correct answers (verifiable) or justified (unverifiable) responses with valid reasoning. Our method offers a clearer picture of the limits of LLM reasoning and the potential for corrections across various datasets. Comprehensive experiments on six LLMs suggest that, without any training or parameter revision, LLMs can achieve up to a 68.53% accuracy improvement on Vague Perception samples through guided understanding. Our work reveals that current baseline methods only activate a small portion of LLMs' reasoning potential, indicating considerable unexplored capacity. This extends the theoretical upper bounds of reasoning accuracy in LLMs. Consequently, this study deepens our understanding of the latent reasoning capacity of LLMs and offers a new perspective on addressing the Vague Perception phenomenon.

Zipeng Ling、Yuehao Tang、Shuliang Liu、Junqi Yang、Shenghong Fu、Chen Huang、Kejia Huang、Yao Wan、Zhichao Hou、Xuming Hu

计算技术、计算机技术

Zipeng Ling,Yuehao Tang,Shuliang Liu,Junqi Yang,Shenghong Fu,Chen Huang,Kejia Huang,Yao Wan,Zhichao Hou,Xuming Hu.WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking[EB/OL].(2025-07-29)[2025-08-10].https://arxiv.org/abs/2507.16199.点此复制

评论