Risk-aware Direct Preference Optimization under Nested Risk Measure
Risk-aware Direct Preference Optimization under Nested Risk Measure
When fine-tuning pre-trained Large Language Models (LLMs) to align with human values and intentions, maximizing the estimated reward can lead to superior performance, but it also introduces potential risks due to deviations from the reference model's intended behavior. Most existing methods typically introduce KL divergence to constrain deviations between the trained model and the reference model; however, this may not be sufficient in certain applications that require tight risk control. In this paper, we introduce Risk-aware Direct Preference Optimization (Ra-DPO), a novel approach that incorporates risk-awareness by employing a class of nested risk measures. This approach formulates a constrained risk-aware advantage function maximization problem and then converts the Bradley-Terry model into a token-level representation. The objective function maximizes the likelihood of the policy while suppressing the deviation between a trained model and the reference model using a sequential risk ratio, thereby enhancing the model's risk-awareness. Experimental results across three open-source datasets: IMDb Dataset, Anthropic HH Dataset, and AlpacaEval, demonstrate the proposed method's superior performance in balancing alignment performance and model drift. Our code is opensourced at https://github.com/zlj123-max/Ra-DPO.
Lijun Zhang、Lin Li、Yajie Qi、Huizhong Song、Yaodong Yang、Jun Wang、Wei Wei
计算技术、计算机技术
Lijun Zhang,Lin Li,Yajie Qi,Huizhong Song,Yaodong Yang,Jun Wang,Wei Wei.Risk-aware Direct Preference Optimization under Nested Risk Measure[EB/OL].(2025-05-26)[2025-07-25].https://arxiv.org/abs/2505.20359.点此复制
评论