首页|REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

来源：

英文摘要

Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. While state-of-the-art applications like ChatGPT or GPT-4 commonly employ Proximal Policy Optimization (PPO), the inclusion of a critic network introduces significant computational overhead. REINFORCE-based methods, such as REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO), address this limitation by eliminating the critic network. However, these approaches face challenges in accurate advantage estimation. Specifically, they estimate advantages independently for responses to each prompt, which can lead to overfitting on simpler prompts and vulnerability to reward hacking and may be biased. To address these challenges, we introduce REINFORCE++, a novel approach that removes the critic model while using the global advantage normalization which is unbiased to improve the training stability. Our empirical evaluation demonstrates that REINFORCE++ exhibits robust performance across various reward models without requiring prompt set truncation. Furthermore, it achieves superior generalization in both RLHF and long chain-of-thought (CoT) settings compared to existing REINFORCE-based methods. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.

作者：Wei Shen、Jian Hu、Jason Klein Liu、Haotian Xu

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Wei Shen,Jian Hu,Jason Klein Liu,Haotian Xu.REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models[EB/OL].(2025-07-28)[2025-08-03].https://arxiv.org/abs/2501.03262.点此复制

REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

评论