首页|Reward Model Interpretability via Optimal and Pessimal Tokens

Reward Model Interpretability via Optimal and Pessimal Tokens

来源：

英文摘要

Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves -- which directly encode human value judgments by turning prompt-response pairs into scalar rewards -- remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training -- distortions that risk propagating through the downstream large language models now deployed to millions.

作者：Brian Christian、Hannah Rose Kirk、Jessica A. F. Thompson、Christopher Summerfield、Tsvetomira Dumbalska

作者单位：

DOI：10.1145/3715275.3732068

学科分类：计算技术、计算机技术

推荐引用：Brian Christian,Hannah Rose Kirk,Jessica A. F. Thompson,Christopher Summerfield,Tsvetomira Dumbalska.Reward Model Interpretability via Optimal and Pessimal Tokens[EB/OL].(2025-06-08)[2025-06-27].https://arxiv.org/abs/2506.07326.点此复制

Reward Model Interpretability via Optimal and Pessimal Tokens

Reward Model Interpretability via Optimal and Pessimal Tokens

评论