|国家预印本平台
首页|TensorSocket: Shared Data Loading for Deep Learning Training

TensorSocket: Shared Data Loading for Deep Learning Training

TensorSocket: Shared Data Loading for Deep Learning Training

来源:Arxiv_logoArxiv
英文摘要

Training deep learning models is a repetitive and resource-intensive process. Data scientists often train several models before landing on a set of parameters (e.g., hyper-parameter tuning) and model architecture (e.g., neural architecture search), among other things that yield the highest accuracy. The computational efficiency of these training tasks depends highly on how well the training data is supplied to the training process. The repetitive nature of these tasks results in the same data processing pipelines running over and over, exacerbating the need for and costs of computational resources. In this paper, we present TensorSocket to reduce the computational needs of deep learning training by enabling simultaneous training processes to share the same data loader. TensorSocket mitigates CPU-side bottlenecks in cases where the collocated training workloads have high throughput on GPU, but are held back by lower data-loading throughput on CPU. TensorSocket achieves this by reducing redundant computations and data duplication across collocated training processes and leveraging modern GPU-GPU interconnects. While doing so, TensorSocket is able to train and balance differently-sized models and serve multiple batch sizes simultaneously and is hardware- and pipeline-agnostic in nature. Our evaluation shows that TensorSocket enables scenarios that are infeasible without data sharing, increases training throughput by up to 100%, and when utilizing cloud instances, achieves cost savings of 50% by reducing the hardware resource needs on the CPU side. Furthermore, TensorSocket outperforms the state-of-the-art solutions for shared data loading such as CoorDL and Joader; it is easier to deploy and maintain and either achieves higher or matches their throughput while requiring fewer CPU resources.

Ties Robroek、Neil Kim Nielsen、Pınar Tözün

10.1145/3749185

计算技术、计算机技术

Ties Robroek,Neil Kim Nielsen,Pınar Tözün.TensorSocket: Shared Data Loading for Deep Learning Training[EB/OL].(2025-08-01)[2025-08-16].https://arxiv.org/abs/2409.18749.点此复制

评论