首页|LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds

LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds

来源：

英文摘要

Jailbreak attacks expose vulnerabilities in safety-aligned LLMs by eliciting harmful outputs through carefully crafted prompts. Existing methods rely on discrete optimization or trained adversarial generators, but are slow, compute-intensive, and often impractical. We argue that these inefficiencies stem from a mischaracterization of the problem. Instead, we frame jailbreaks as inference-time misalignment and introduce LIAR (Leveraging Inference-time misAlignment to jailbReak), a fast, black-box, best-of-$N$ sampling attack requiring no training. LIAR matches state-of-the-art success rates while reducing perplexity by $10\times$ and Time-to-Attack from hours to seconds. We also introduce a theoretical "safety net against jailbreaks" metric to quantify safety alignment strength and derive suboptimality bounds. Our work offers a simple yet effective tool for evaluating LLM robustness and advancing alignment research.

作者：James Beetham、Souradip Chakraborty、Mengdi Wang、Furong Huang、Amrit Singh Bedi、Mubarak Shah

作者单位：

学科分类：计算技术、计算机技术

推荐引用：James Beetham,Souradip Chakraborty,Mengdi Wang,Furong Huang,Amrit Singh Bedi,Mubarak Shah.LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds[EB/OL].(2025-07-03)[2025-07-22].https://arxiv.org/abs/2412.05232.点此复制

LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds

LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds

评论