|国家预印本平台
首页|BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

来源:Arxiv_logoArxiv
英文摘要

AI agents have the potential to significantly alter the cybersecurity landscape. Here, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards of \$10-\$30,485, covering 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a specific vulnerability. We evaluate 8 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, and DeepSeek-R1. Given up to three attempts, the top-performing agents are OpenAI Codex CLI: o3-high (12.5% on Detect, mapping to \$3,720; 90% on Patch, mapping to \$14,152), Custom Agent with Claude 3.7 Sonnet Thinking (67.5% on Exploit), and OpenAI Codex CLI: o4-mini (90% on Patch, mapping to \$14,422). OpenAI Codex CLI: o3-high, OpenAI Codex CLI: o4-mini, and Claude Code are more capable at defense, achieving higher Patch scores of 90%, 90%, and 87.5%, compared to Exploit scores of 47.5%, 32.5%, and 57.5% respectively; while the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 37.5-67.5% and Patch scores of 35-60%.

Daniel E. Ho、Percy Liang、Andy K. Zhang、Riya Dulepet、Thomas Qin、Ron Y. Wang、Junrong Wu、Kyleen Liao、Jiliang Li、Jinghan Hu、Sara Hong、Nardos Demilew、Shivatmica Murgai、Jason Tran、Nishka Kacheria、Ethan Ho、Denis Liu、Lauren McLane、Olivia Bruvik、Dai-Rong Han、Seungwoo Kim、Akhil Vyas、Cuiyuanxiu Chen、Ryan Li、Weiran Xu、Jonathan Z. Ye、Prerit Choudhary、Siddharth M. Bhatia、Vikram Sivashankar、Yuxuan Bao、Dawn Song、Dan Boneh、Joey Ji、Celeste Menders

计算技术、计算机技术

Daniel E. Ho,Percy Liang,Andy K. Zhang,Riya Dulepet,Thomas Qin,Ron Y. Wang,Junrong Wu,Kyleen Liao,Jiliang Li,Jinghan Hu,Sara Hong,Nardos Demilew,Shivatmica Murgai,Jason Tran,Nishka Kacheria,Ethan Ho,Denis Liu,Lauren McLane,Olivia Bruvik,Dai-Rong Han,Seungwoo Kim,Akhil Vyas,Cuiyuanxiu Chen,Ryan Li,Weiran Xu,Jonathan Z. Ye,Prerit Choudhary,Siddharth M. Bhatia,Vikram Sivashankar,Yuxuan Bao,Dawn Song,Dan Boneh,Joey Ji,Celeste Menders.BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems[EB/OL].(2025-07-10)[2025-07-25].https://arxiv.org/abs/2505.15216.点此复制

评论