Libra: Assessing and Improving Reward Model by Learning to Think
Libra: Assessing and Improving Reward Model by Learning to Think
Reinforcement learning (RL) has significantly improved the reasoning ability of large language models. However, current reward models underperform in challenging reasoning scenarios and predominant RL training paradigms rely on rule-based or reference-based rewards, which impose two critical limitations: 1) the dependence on finely annotated reference answer to attain rewards; and 2) the requirement for constrained output format. These limitations fundamentally hinder further RL data scaling and sustained enhancement of model reasoning performance. To address these limitations, we propose a comprehensive framework for evaluating and improving the performance of reward models in complex reasoning scenarios. We first present a reasoning-oriented benchmark (Libra Bench), systematically constructed from a diverse collection of challenging mathematical problems and advanced reasoning models, to address the limitations of existing reward model benchmarks in reasoning scenarios. We further introduce a novel approach for improving the generative reward model via learning-to-think methodologies. Based on the proposed approach, we develop Libra-RM series, a collection of generative reward models with reasoning capabilities that achieve state-of-the-art results on various benchmarks. Comprehensive downstream experiments are conducted and the experimental results demonstrate the correlation between our Libra Bench and downstream application, and the potential of Libra-RM to further improve reasoning models with unlabeled data.
Meng Zhou、Bei Li、Jiahao Liu、Xiaowen Shi、Yang Bai、Rongxiang Weng、Jingang Wang、Xunliang Cai
计算技术、计算机技术
Meng Zhou,Bei Li,Jiahao Liu,Xiaowen Shi,Yang Bai,Rongxiang Weng,Jingang Wang,Xunliang Cai.Libra: Assessing and Improving Reward Model by Learning to Think[EB/OL].(2025-07-29)[2025-08-11].https://arxiv.org/abs/2507.21645.点此复制
评论