|国家预印本平台
首页|REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

来源:Arxiv_logoArxiv
英文摘要

Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. While state-of-the-art applications like ChatGPT or GPT-4 commonly employ Proximal Policy Optimization (PPO), the inclusion of a critic network introduces significant computational overhead. REINFORCE-based methods, such as REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO), address this limitation by eliminating the critic network. However, these approaches face challenges in accurate advantage estimation. Specifically, they estimate advantages independently for responses to each prompt, which can lead to overfitting on simpler prompts and vulnerability to reward hacking and may be biased. To address these challenges, we introduce REINFORCE++, a novel approach that removes the critic model while using the global advantage normalization which is unbiased to improve the training stability. Our empirical evaluation demonstrates that REINFORCE++ exhibits robust performance across various reward models without requiring prompt set truncation. Furthermore, it achieves superior generalization in both RLHF and long chain-of-thought (CoT) settings compared to existing REINFORCE-based methods. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.

Wei Shen、Jian Hu、Jason Klein Liu、Haotian Xu

计算技术、计算机技术

Wei Shen,Jian Hu,Jason Klein Liu,Haotian Xu.REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models[EB/OL].(2025-07-28)[2025-08-03].https://arxiv.org/abs/2501.03262.点此复制

评论