首页|Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

来源：

英文摘要

Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models' performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating entirely correct proofs for less than 20% of problems and failing even on basic ones; 2) models exhibit a diverse spectrum of reasoning failures, prominently demonstrating the lack of guarantees for the correctness and rigor of single-step reasoning; and 3) models show hallucination and incompleteness during the reasoning process. Our findings reveal that models' self-reflection is insufficient to resolve the current logical dilemmas, necessitating formalized and fine-grained logical training.

作者：Dadi Guo、Jiayu Liu、Zhiyuan Fan、Zhitao He、Haoran Li、Yumeng Wang、Yi R. Fung

作者单位：

学科分类：数学

推荐引用：Dadi Guo,Jiayu Liu,Zhiyuan Fan,Zhitao He,Haoran Li,Yumeng Wang,Yi R. Fung.Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models[EB/OL].(2025-06-23)[2025-07-21].https://arxiv.org/abs/2506.17114.点此复制

Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

评论