LLM Jailbreak Oracle
LLM Jailbreak Oracle
As large language models (LLMs) become increasingly deployed in safety-critical applications, the lack of systematic methods to assess their vulnerability to jailbreak attacks presents a critical security gap. We introduce the jailbreak oracle problem: given a model, prompt, and decoding strategy, determine whether a jailbreak response can be generated with likelihood exceeding a specified threshold. This formalization enables a principled study of jailbreak vulnerabilities. Answering the jailbreak oracle problem poses significant computational challenges -- the search space grows exponentially with the length of the response tokens. We present Boa, the first efficient algorithm for solving the jailbreak oracle problem. Boa employs a three-phase search strategy: (1) constructing block lists to identify refusal patterns, (2) breadth-first sampling to identify easily accessible jailbreaks, and (3) depth-first priority search guided by fine-grained safety scores to systematically explore promising low-probability paths. Boa enables rigorous security assessments including systematic defense evaluation, standardized comparison of red team attacks, and model certification under extreme adversarial conditions.
Shuyi Lin、Anshuman Suri、Alina Oprea、Cheng Tan
计算技术、计算机技术
Shuyi Lin,Anshuman Suri,Alina Oprea,Cheng Tan.LLM Jailbreak Oracle[EB/OL].(2025-06-17)[2025-07-02].https://arxiv.org/abs/2506.17299.点此复制
评论