|国家预印本平台
首页|Bridging Offline and Online Reinforcement Learning for LLMs

Bridging Offline and Online Reinforcement Learning for LLMs

Bridging Offline and Online Reinforcement Learning for LLMs

来源:Arxiv_logoArxiv
英文摘要

We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.

Jack Lanchantin、Angelica Chen、Janice Lan、Xian Li、Swarnadeep Saha、Tianlu Wang、Jing Xu、Ping Yu、Weizhe Yuan、Jason E Weston、Sainbayar Sukhbaatar、Ilia Kulikov

计算技术、计算机技术

Jack Lanchantin,Angelica Chen,Janice Lan,Xian Li,Swarnadeep Saha,Tianlu Wang,Jing Xu,Ping Yu,Weizhe Yuan,Jason E Weston,Sainbayar Sukhbaatar,Ilia Kulikov.Bridging Offline and Online Reinforcement Learning for LLMs[EB/OL].(2025-06-26)[2025-08-02].https://arxiv.org/abs/2506.21495.点此复制

评论