首页|Odyssey: Adaptive Policy Selection for Resilient Distributed Training

Odyssey: Adaptive Policy Selection for Resilient Distributed Training

来源：

英文摘要

Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods, such as redundant computation, dynamic parallelism, and data rerouting, each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-recovery inefficiencies. We propose Odyssey, an adaptive fault-tolerant system that intelligently selects optimal recovery strategies when a failure occurs. Odyssey achieves this through a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations. Experiments on a 32-card cluster show that Odyssey maintains a performance gap of within 11.00% between post-recovery and failure-free training, while preserving model convergence and efficient memory usage. Compared to state-of-the-art methods, Odyssey achieves up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle, respectively.

作者：Yuhang Zhou、Zhibin Wang、Peng Jiang、Haoran Xia、Junhe Lu、Qianyu Jiang、Rong Gu、Hengxi Xu、Xinjing Huang、Guanghuan Fang、Zhiheng Hu、Jingyi Zhang、Yongjin Cai、Jian He、Chen Tian

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Yuhang Zhou,Zhibin Wang,Peng Jiang,Haoran Xia,Junhe Lu,Qianyu Jiang,Rong Gu,Hengxi Xu,Xinjing Huang,Guanghuan Fang,Zhiheng Hu,Jingyi Zhang,Yongjin Cai,Jian He,Chen Tian.Odyssey: Adaptive Policy Selection for Resilient Distributed Training[EB/OL].(2025-08-29)[2025-09-10].https://arxiv.org/abs/2508.21613.点此复制

Odyssey: Adaptive Policy Selection for Resilient Distributed Training

Odyssey: Adaptive Policy Selection for Resilient Distributed Training

评论