首页|On the Robustness of Reward Models for Language Model Alignment

On the Robustness of Reward Models for Language Model Alignment

来源：

英文摘要

The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss are prone to over-optimization, losing generalizability to unseen input distributions. In this paper, we study the cause of over-optimization in RM training and its downstream effects on the RLHF procedure, accentuating the importance of distributional robustness of RMs in unseen data. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Then, we propose batch-wise sum-to-zero regularization (BSR) to enforce zero-centered reward sum per batch, constraining the rewards with extreme magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness. Subsequently, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0 reduces generation length by 40% while adding a 7% increase in win rate, further highlighting that robustness in RMs induces robustness in RLHF training. We release the code, data, and models: https://github.com/LinkedIn-XFACT/RM-Robustness.

作者：Jiwoo Hong、Noah Lee、Eunki Kim、Guijin Son、Woojin Chung、Aman Gupta、Shao Tang、James Thorne

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Jiwoo Hong,Noah Lee,Eunki Kim,Guijin Son,Woojin Chung,Aman Gupta,Shao Tang,James Thorne.On the Robustness of Reward Models for Language Model Alignment[EB/OL].(2025-05-12)[2025-06-19].https://arxiv.org/abs/2505.07271.点此复制

On the Robustness of Reward Models for Language Model Alignment

On the Robustness of Reward Models for Language Model Alignment

评论