首页|Derailing Non-Answers via Logit Suppression at Output Subspace Boundaries in RLHF-Aligned Language Models

Derailing Non-Answers via Logit Suppression at Output Subspace Boundaries in RLHF-Aligned Language Models

来源：

英文摘要

We introduce a method to reduce refusal rates of large language models (LLMs) on sensitive content without modifying model weights or prompts. Motivated by the observation that refusals in certain models were often preceded by the specific token sequence of a token marking the beginning of the chain-of-thought (CoT) block (<think>) followed by a double newline token (\n\n), we investigate the impact of two simple formatting adjustments during generation: suppressing \n\n after <think> and suppressing the end-of-sequence token after the end of the CoT block (</think>). Our method requires no datasets, parameter changes, or training, relying solely on modifying token probabilities during generation. In our experiments with official DeepSeek-R1 distillations, these interventions increased the proportion of substantive answers to sensitive prompts without affecting performance on standard benchmarks. Our findings suggest that refusal behaviors can be circumvented by blocking refusal subspaces at specific points in the generation process.

作者：Harvey Dam、Jonas Knochelmann、Vinu Joseph、Ganesh Gopalakrishnan

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Harvey Dam,Jonas Knochelmann,Vinu Joseph,Ganesh Gopalakrishnan.Derailing Non-Answers via Logit Suppression at Output Subspace Boundaries in RLHF-Aligned Language Models[EB/OL].(2025-05-28)[2025-06-19].https://arxiv.org/abs/2505.23848.点此复制

Derailing Non-Answers via Logit Suppression at Output Subspace Boundaries in RLHF-Aligned Language Models

Derailing Non-Answers via Logit Suppression at Output Subspace Boundaries in RLHF-Aligned Language Models

评论