ShiQ: Bringing back Bellman to LLMs
ShiQ: Bringing back Bellman to LLMs
The fine-tuning of pre-trained large language models (LLMs) using reinforcement learning (RL) is generally formulated as direct policy optimization. This approach was naturally favored as it efficiently improves a pretrained LLM, seen as an initial policy. Another RL paradigm, Q-learning methods, has received far less attention in the LLM community while demonstrating major success in various non-LLM RL tasks. In particular, Q-learning effectiveness comes from its sample efficiency and ability to learn offline, which is particularly valuable given the high computational cost of sampling with LLMs. However, naively applying a Q-learning-style update to the model's logits is ineffective due to the specificity of LLMs. Our core contribution is to derive theoretically grounded loss functions from Bellman equations to adapt Q-learning methods to LLMs. To do so, we carefully adapt insights from the RL literature to account for LLM-specific characteristics, ensuring that the logits become reliable Q-value estimates. We then use this loss to build a practical algorithm, ShiQ for Shifted-Q, that supports off-policy, token-wise learning while remaining simple to implement. Finally, we evaluate ShiQ on both synthetic data and real-world benchmarks, e.g., UltraFeedback and BFCL-V3, demonstrating its effectiveness in both single-turn and multi-turn LLM settings
Pierre Clavier、Nathan Grinsztajn、Raphael Avalos、Yannis Flet-Berliac、Irem Ergun、Omar D. Domingues、Eugene Tarassov、Olivier Pietquin、Pierre H. Richemond、Florian Strub、Matthieu Geist
计算技术、计算机技术
Pierre Clavier,Nathan Grinsztajn,Raphael Avalos,Yannis Flet-Berliac,Irem Ergun,Omar D. Domingues,Eugene Tarassov,Olivier Pietquin,Pierre H. Richemond,Florian Strub,Matthieu Geist.ShiQ: Bringing back Bellman to LLMs[EB/OL].(2025-05-16)[2025-06-05].https://arxiv.org/abs/2505.11081.点此复制
评论