首页|Doubly Robust Alignment for Large Language Models

Doubly Robust Alignment for Large Language Models

来源：

英文摘要

This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM

作者：Erhan Xu、Kai Ye、Hongyi Zhou、Luhan Zhu、Francesco Quinzan、Chengchun Shi

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Erhan Xu,Kai Ye,Hongyi Zhou,Luhan Zhu,Francesco Quinzan,Chengchun Shi.Doubly Robust Alignment for Large Language Models[EB/OL].(2025-06-01)[2025-07-21].https://arxiv.org/abs/2506.01183.点此复制

Doubly Robust Alignment for Large Language Models

Doubly Robust Alignment for Large Language Models

评论