|国家预印本平台
首页|Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters

Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters

Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters

来源:Arxiv_logoArxiv
英文摘要

The high GPU demand of ML training makes it hard to allocate large homogeneous clusters of high-end GPUs in a single availability zone. Leveraging heterogeneous GPUs available within and across zones can improve throughput at a reasonable cost. However, training ML models on heterogeneous resources introduces significant challenges, such as stragglers and a large search space of possible job configurations. Current systems lack support for efficiently training models on heterogeneous resources. We present Sailor, a system that automates distributed training over heterogeneous, geo-distributed, and dynamically available resources. Sailor combines an efficient search space exploration algorithm, accurate runtime and memory footprint simulation, and a distributed training framework that supports different types of heterogeneity to optimize training throughput and cost.

Ana Klimovic、Foteini Strati、Zhendong Zhang、George Manos、Ixeia Sánchez Périz、Qinghao Hu、Tiancheng Chen、Berk Buzcu、Song Han、Pamela Delgado

计算技术、计算机技术

Ana Klimovic,Foteini Strati,Zhendong Zhang,George Manos,Ixeia Sánchez Périz,Qinghao Hu,Tiancheng Chen,Berk Buzcu,Song Han,Pamela Delgado.Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters[EB/OL].(2025-04-23)[2025-05-13].https://arxiv.org/abs/2504.17096.点此复制

评论