首页|ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates

ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates

来源：

英文摘要

Fine-tuning large language models (LLMs) often exceeds GPU memory limits, prompting systems to offload model states to CPU memory. However, existing offloaded training frameworks like ZeRO-Offload treat all parameters equally and update the full model on the CPU, causing severe GPU stalls, where fast, expensive GPUs sit idle waiting for slow CPU updates and limited-bandwidth PCIe transfers. We present ZenFlow, a new offloading framework that prioritizes important parameters and decouples updates between GPU and CPU. ZenFlow performs in-place updates of important gradients on GPU, while asynchronously offloading and accumulating less important ones on CPU, fully overlapping CPU work with GPU computation. To scale across GPUs, ZenFlow introduces a lightweight gradient selection method that exploits a novel spatial and temporal locality property of important gradients, avoiding costly global synchronization. ZenFlow achieves up to 5x end-to-end speedup, 2x lower PCIe traffic, and reduces GPU stalls by over 85 percent, all while preserving accuracy.

作者：Tingfeng Lan、Yusen Wu、Bin Ma、Zhaoyuan Su、Rui Yang、Tekin Bicer、Dong Li、Yue Cheng

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Tingfeng Lan,Yusen Wu,Bin Ma,Zhaoyuan Su,Rui Yang,Tekin Bicer,Dong Li,Yue Cheng.ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates[EB/OL].(2025-05-18)[2025-06-06].https://arxiv.org/abs/2505.12242.点此复制

ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates

ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates

评论