首页|Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models

Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models

来源：

英文摘要

Fine-tuning large language models (LLMs) for reasoning tasks using reinforcement learning methods like Group Relative Policy Optimization (GRPO) is computationally expensive. To address this, we propose a predictive framework that models training dynamics and helps optimize resource usage. Through experiments on Llama and Qwen models (3B 8B), we derive an empirical scaling law based on model size, initial performance, and training progress. This law predicts reward trajectories and identifies three consistent training phases: slow start, rapid improvement, and plateau. We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance. Our approach generalizes across model types, providing a practical guide for efficient GRPO-based fine-tuning.

作者：Datta Nimmaturi、Vaishnavi Bhargava、Rajat Ghosh、Johnu George、Debojyoti Dutta

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Datta Nimmaturi,Vaishnavi Bhargava,Rajat Ghosh,Johnu George,Debojyoti Dutta.Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models[EB/OL].(2025-07-24)[2025-08-10].https://arxiv.org/abs/2507.18014.点此复制

Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models

Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models

评论