|国家预印本平台
首页|ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

来源:Arxiv_logoArxiv
英文摘要

Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. ITBench includes an initial set of 94 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 13.8% of SRE scenarios, 25.2% of CISO scenarios, and 0% of FinOps scenarios. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast.

Hirokuni Kitahara、Pooja Aggarwal、Takumi Yanagawa、Rong Lee、Noah Zheutlin、Saurabh Jha、Jae-wook Ahn、Yu Deng、Ameet Rahane、Chandrasekhar Narayanaswami、Xinbo Wu、Gerard Vanloo、Naoki Abe、Tianyin Xu、Pavankumar Murali、Daby Sow、Rohan Arora、Pratibha Moogi、Yuji Watanabe、Laura Shwartz、Divya Pathak、Felix George、Yinfang Chen、Carlos Fonseca、Bhavya Bhavya、Ruchir Puri、Bekir O. Turkkan、Jackson Clark、Ruchi Mahindru、Nicholas C. M. Fuller、Harshit Kumar、Anca Sailer、Prateeti Mohapatra、Ting Dai、Michael Nidd、Lav R. Varshney、Suranjana Samanta、Saki Takano、Amit Paradkar、Debanjana Kar、Mudit Verma、Pranjal Gupta、Oishik Chatterjee

IBMIBMIBMIBMIBMIBMIBMIBMIBMIBMUniversity of Illinois at Urbana-ChampaignIBMIBMUniversity of Illinois at Urbana-ChampaignIBMIBMIBMIBMIBMIBMIBMIBMUniversity of Illinois at Urbana-ChampaignIBMIBMIBMIBMUniversity of Illinois at Urbana-ChampaignIBMIBMIBMIBMIBMIBMIBMUniversity of Illinois at Urbana-ChampaignIBMIBMIBMIBMIBMIBMIBM

计算技术、计算机技术

Hirokuni Kitahara,Pooja Aggarwal,Takumi Yanagawa,Rong Lee,Noah Zheutlin,Saurabh Jha,Jae-wook Ahn,Yu Deng,Ameet Rahane,Chandrasekhar Narayanaswami,Xinbo Wu,Gerard Vanloo,Naoki Abe,Tianyin Xu,Pavankumar Murali,Daby Sow,Rohan Arora,Pratibha Moogi,Yuji Watanabe,Laura Shwartz,Divya Pathak,Felix George,Yinfang Chen,Carlos Fonseca,Bhavya Bhavya,Ruchir Puri,Bekir O. Turkkan,Jackson Clark,Ruchi Mahindru,Nicholas C. M. Fuller,Harshit Kumar,Anca Sailer,Prateeti Mohapatra,Ting Dai,Michael Nidd,Lav R. Varshney,Suranjana Samanta,Saki Takano,Amit Paradkar,Debanjana Kar,Mudit Verma,Pranjal Gupta,Oishik Chatterjee.ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks[EB/OL].(2025-02-07)[2025-05-24].https://arxiv.org/abs/2502.05352.点此复制

评论