|国家预印本平台
首页|Synthesis by Design: Controlled Data Generation via Structural Guidance

Synthesis by Design: Controlled Data Generation via Structural Guidance

Synthesis by Design: Controlled Data Generation via Structural Guidance

来源:Arxiv_logoArxiv
英文摘要

Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation quality and problem complexity. To address this, we propose to extract structural information with generated problem-solving code from mathematical reasoning and guide data generation with structured solutions. Applied to MATH and GSM8K, our approach produces 39K problems with labeled intermediate steps and a 6.1K-problem benchmark of higher difficulty. Results on our benchmark show that model performance declines as reasoning length increases. Additionally, we conducted fine-tuning experiments using the proposed training data on a range of LLMs, and the results validate the effectiveness of our dataset. We hope the proposed method and dataset will contribute to future research in enhancing LLM reasoning capabilities. Our code and data are available at https://github.com/OpenCausaLab/StructuralGeneration.

Lei Xu、Sirui Chen、Yuxuan Huang、Chaochao Lu

数学

Lei Xu,Sirui Chen,Yuxuan Huang,Chaochao Lu.Synthesis by Design: Controlled Data Generation via Structural Guidance[EB/OL].(2025-06-09)[2025-07-16].https://arxiv.org/abs/2506.07664.点此复制

评论