|国家预印本平台
首页|Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

来源:Arxiv_logoArxiv
英文摘要

Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.

Ziqi Miao、Lijun Li、Yuan Xiong、Zhenhua Liu、Pengyu Zhu、Jing Shao

计算技术、计算机技术

Ziqi Miao,Lijun Li,Yuan Xiong,Zhenhua Liu,Pengyu Zhu,Jing Shao.Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models[EB/OL].(2025-07-07)[2025-07-16].https://arxiv.org/abs/2507.05248.点此复制

评论