AidAI: Automated Incident Diagnosis for AI Workloads in the Cloud
AidAI: Automated Incident Diagnosis for AI Workloads in the Cloud
AI workloads experience frequent incidents due to intensive hardware utilization and extended training times. The current incident management workflow is provider-centric, where customers report incidents and place the entire troubleshooting responsibility on the infrastructure provider. However, the inherent knowledge gap between customer and provider significantly impacts incident resolution efficiency. In AI infrastructure, incidents may take several days on average to mitigate, resulting in delays and productivity losses. To address these issues, we present AidAI, a customer-centric system that provides immediate incident diagnosis for customers and streamlines the creation of incident tickets for unresolved issues. The key idea of AidAI is to construct internal knowledge bases from historical on-call experiences during the offline phase and mimic the reasoning process of human experts to diagnose incidents through trial and error in the online phase. Evaluations using real-world incident records in Microsoft show that AidAI achieves an average Micro F1 score of 0.854 and Macro F1 score of 0.816 without significant overhead.
Yifan Xiong、Baochun Li、Hong Xu、Peng Cheng、Yitao Yang、Yangtao Deng
计算技术、计算机技术
Yifan Xiong,Baochun Li,Hong Xu,Peng Cheng,Yitao Yang,Yangtao Deng.AidAI: Automated Incident Diagnosis for AI Workloads in the Cloud[EB/OL].(2025-06-02)[2025-06-17].https://arxiv.org/abs/2506.01481.点此复制
评论