|国家预印本平台
首页|Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach

Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach

Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach

来源:Arxiv_logoArxiv
英文摘要

Aligning large language models (LLMs) with human preferences usually requires fine-tuning methods such as RLHF and DPO. These methods directly optimize the model parameters, so they cannot be used in test-time to improve model performance, nor are they applicable when the model weights are not accessible. In contrast, test-time methods sidestep weight updates by leveraging reward functions to guide and improve output quality. However, they incur high inference costs, and their one-shot guidance is often based on imperfect reward or value functions, leading to suboptimal outputs. In this work, we present a method named Iterative Reweight-then-Optimize (IRO), a reinforcement learning (RL) framework that performs RL-style alignment of the (frozen) base model without touching its parameters. During training, each iteration (i) samples candidates from the base model, (ii) resamples using current value functions, and (iii) trains a new lightweight value function that guides the next decoding pass. At test time, the value functions are used to guide the base model generation via a search-based optimization process. Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI's reinforcement fine-tuning (RFT), but without requiring access to the model weights.

Xinnan Zhang、Chenliang Li、Siliang Zeng、Jiaxiang Li、Zhongruo Wang、Kaixiang Lin、Songtao Lu、Alfredo Garcia、Mingyi Hong

计算技术、计算机技术

Xinnan Zhang,Chenliang Li,Siliang Zeng,Jiaxiang Li,Zhongruo Wang,Kaixiang Lin,Songtao Lu,Alfredo Garcia,Mingyi Hong.Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach[EB/OL].(2025-07-03)[2025-07-16].https://arxiv.org/abs/2506.17828.点此复制

评论