|国家预印本平台
首页|Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO

来源:Arxiv_logoArxiv
英文摘要

Reinforcement learning (RL) has demonstrated significant success in enhancing reasoning capabilities in large language models (LLMs). One of the most widely used RL methods is Group Relative Policy Optimization (GRPO)~\cite{Shao-2024-Deepseekmath}, known for its memory efficiency and success in training DeepSeek-R1~\cite{Guo-2025-Deepseek}. However, GRPO stalls when all sampled responses in a group are incorrect -- referred to as an \emph{all-negative-sample} group -- as it fails to update the policy, hindering learning progress. The contributions of this paper are two-fold. First, we propose a simple yet effective framework that introduces response diversity within all-negative-sample groups in GRPO using AI feedback. We also provide a theoretical analysis, via a stylized model, showing how this diversification improves learning dynamics. Second, we empirically validate our approach, showing the improved performance across various model sizes (7B, 14B, 32B) in both offline and online learning settings with 10 benchmarks, including base and distilled variants. Our findings highlight that learning from all-negative-sample groups is not only feasible but beneficial, advancing recent insights from \citet{Xiong-2025-Minimalist}.

Peter Chen、Xiaopeng Li、Ziniu Li、Xi Chen、Tianyi Lin

计算技术、计算机技术

Peter Chen,Xiaopeng Li,Ziniu Li,Xi Chen,Tianyi Lin.Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO[EB/OL].(2025-05-16)[2025-06-30].https://arxiv.org/abs/2505.11595.点此复制

评论