|国家预印本平台
| 注册
首页|Odyssey: Adaptive Policy Selection for Resilient Distributed Training

Odyssey: Adaptive Policy Selection for Resilient Distributed Training

Odyssey: Adaptive Policy Selection for Resilient Distributed Training

来源:Arxiv_logoArxiv
英文摘要

Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods, such as redundant computation, dynamic parallelism, and data rerouting, each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-recovery inefficiencies. We propose Odyssey, an adaptive fault-tolerant system that intelligently selects optimal recovery strategies when a failure occurs. Odyssey achieves this through a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations. Experiments on a 32-card cluster show that Odyssey maintains a performance gap of within 11.00% between post-recovery and failure-free training, while preserving model convergence and efficient memory usage. Compared to state-of-the-art methods, Odyssey achieves up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle, respectively.

Yuhang Zhou、Zhibin Wang、Peng Jiang、Haoran Xia、Junhe Lu、Qianyu Jiang、Rong Gu、Hengxi Xu、Xinjing Huang、Guanghuan Fang、Zhiheng Hu、Jingyi Zhang、Yongjin Cai、Jian He、Chen Tian

计算技术、计算机技术

Yuhang Zhou,Zhibin Wang,Peng Jiang,Haoran Xia,Junhe Lu,Qianyu Jiang,Rong Gu,Hengxi Xu,Xinjing Huang,Guanghuan Fang,Zhiheng Hu,Jingyi Zhang,Yongjin Cai,Jian He,Chen Tian.Odyssey: Adaptive Policy Selection for Resilient Distributed Training[EB/OL].(2025-08-29)[2025-09-10].https://arxiv.org/abs/2508.21613.点此复制

评论