CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective
CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective
Although large language models (LLMs) show promise in solving complex mathematical tasks, existing evaluation paradigms rely solely on a coarse measure of overall answer accuracy, which are insufficient for assessing their authentic capabilities. In this paper, we propose \textbf{CogMath}, which comprehensively assesses LLMs' mathematical abilities through the lens of human cognition. Specifically, inspired by psychological theories, CogMath formalizes human reasoning process into 3 stages: \emph{problem comprehension}, \emph{problem solving}, and \emph{solution summarization}. Within these stages, we investigate perspectives such as numerical calculation, knowledge, and counterfactuals, and design a total of 9 fine-grained evaluation dimensions. In each dimension, we develop an ``\emph{Inquiry}-\emph{Judge}-\emph{Reference}'' multi-agent system to generate inquiries that assess LLMs' mastery from this dimension. An LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions. By applying CogMath on three benchmarks, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30\%-40\%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in-depth insights to further enhance their reasoning abilities.
Jiayu Liu、Zhenya Huang、Wei Dai、Cheng Cheng、Jinze Wu、Jing Sha、Song Li、Qi Liu、Shijin Wang、Enhong Chen
数学
Jiayu Liu,Zhenya Huang,Wei Dai,Cheng Cheng,Jinze Wu,Jing Sha,Song Li,Qi Liu,Shijin Wang,Enhong Chen.CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective[EB/OL].(2025-06-04)[2025-07-16].https://arxiv.org/abs/2506.04481.点此复制
评论