首页|Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

来源：

英文摘要

Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.

作者：Ziqi Miao、Lijun Li、Yuan Xiong、Zhenhua Liu、Pengyu Zhu、Jing Shao

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Ziqi Miao,Lijun Li,Yuan Xiong,Zhenhua Liu,Pengyu Zhu,Jing Shao.Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models[EB/OL].(2025-07-07)[2025-07-16].https://arxiv.org/abs/2507.05248.点此复制

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

评论