首页|Risk-aware Direct Preference Optimization under Nested Risk Measure

Risk-aware Direct Preference Optimization under Nested Risk Measure

来源：

英文摘要

When fine-tuning pre-trained Large Language Models (LLMs) to align with human values and intentions, maximizing the estimated reward can lead to superior performance, but it also introduces potential risks due to deviations from the reference model's intended behavior. Most existing methods typically introduce KL divergence to constrain deviations between the trained model and the reference model; however, this may not be sufficient in certain applications that require tight risk control. In this paper, we introduce Risk-aware Direct Preference Optimization (Ra-DPO), a novel approach that incorporates risk-awareness by employing a class of nested risk measures. This approach formulates a constrained risk-aware advantage function maximization problem and then converts the Bradley-Terry model into a token-level representation. The objective function maximizes the likelihood of the policy while suppressing the deviation between a trained model and the reference model using a sequential risk ratio, thereby enhancing the model's risk-awareness. Experimental results across three open-source datasets: IMDb Dataset, Anthropic HH Dataset, and AlpacaEval, demonstrate the proposed method's superior performance in balancing alignment performance and model drift. Our code is opensourced at https://github.com/zlj123-max/Ra-DPO.

作者：Lijun Zhang、Lin Li、Yajie Qi、Huizhong Song、Yaodong Yang、Jun Wang、Wei Wei

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Lijun Zhang,Lin Li,Yajie Qi,Huizhong Song,Yaodong Yang,Jun Wang,Wei Wei.Risk-aware Direct Preference Optimization under Nested Risk Measure[EB/OL].(2025-05-26)[2025-07-25].https://arxiv.org/abs/2505.20359.点此复制

Risk-aware Direct Preference Optimization under Nested Risk Measure

Risk-aware Direct Preference Optimization under Nested Risk Measure

评论