|国家预印本平台
首页|DeepCEE: Efficient Cross-Region Model Distributed Training System under Heterogeneous GPUs and Networks

DeepCEE: Efficient Cross-Region Model Distributed Training System under Heterogeneous GPUs and Networks

DeepCEE: Efficient Cross-Region Model Distributed Training System under Heterogeneous GPUs and Networks

来源:Arxiv_logoArxiv
英文摘要

Most existing training systems focus on a single region. In contrast, we envision that cross-region training offers more flexible GPU resource allocation and yields significant potential. However, the hierarchical cluster topology and unstable networks in the cloud-edge-end (CEE) environment, a typical cross-region scenario, pose substantial challenges to building an efficient and autonomous model training system. We propose DeepCEE, a geo-distributed model training system tailored for heterogeneous GPUs and networks in CEE environments. DeepCEE adopts a communication-centric design philosophy to tackle challenges arising from slow and unstable inter-region networks. It begins with a heterogeneous device profiler that identifies and groups devices based on both network and compute characteristics. Leveraging device groups, DeepCEE implements compact, zero-bubble pipeline parallelism, automatically deriving optimal parallel strategies. To further adapt to runtime variability, DeepCEE integrates a dynamic environment adapter that reacts to network fluctuations. Extensive evaluations demonstrate that DeepCEE achieves 1.3-2.8x higher training throughput compared to widely used and SOTA training systems.

Jinquan Wang、Xiaojian Liao、Xuzhao Liu、Jiashun Suo、Zhisheng Huo、Chenhao Zhang、Xiangrong Xu、Runnan Shen、Xilong Xie、Limin Xiao

计算技术、计算机技术

Jinquan Wang,Xiaojian Liao,Xuzhao Liu,Jiashun Suo,Zhisheng Huo,Chenhao Zhang,Xiangrong Xu,Runnan Shen,Xilong Xie,Limin Xiao.DeepCEE: Efficient Cross-Region Model Distributed Training System under Heterogeneous GPUs and Networks[EB/OL].(2025-05-21)[2025-06-05].https://arxiv.org/abs/2505.15536.点此复制

评论