LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds
LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds
Jailbreak attacks expose vulnerabilities in safety-aligned LLMs by eliciting harmful outputs through carefully crafted prompts. Existing methods rely on discrete optimization or trained adversarial generators, but are slow, compute-intensive, and often impractical. We argue that these inefficiencies stem from a mischaracterization of the problem. Instead, we frame jailbreaks as inference-time misalignment and introduce LIAR (Leveraging Inference-time misAlignment to jailbReak), a fast, black-box, best-of-$N$ sampling attack requiring no training. LIAR matches state-of-the-art success rates while reducing perplexity by $10\times$ and Time-to-Attack from hours to seconds. We also introduce a theoretical "safety net against jailbreaks" metric to quantify safety alignment strength and derive suboptimality bounds. Our work offers a simple yet effective tool for evaluating LLM robustness and advancing alignment research.
James Beetham、Souradip Chakraborty、Mengdi Wang、Furong Huang、Amrit Singh Bedi、Mubarak Shah
计算技术、计算机技术
James Beetham,Souradip Chakraborty,Mengdi Wang,Furong Huang,Amrit Singh Bedi,Mubarak Shah.LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds[EB/OL].(2025-07-03)[2025-07-22].https://arxiv.org/abs/2412.05232.点此复制
评论