|国家预印本平台
首页|scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics

scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics

scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics

来源:Arxiv_logoArxiv
英文摘要

Modern single-cell datasets now comprise hundreds of millions of cells, presenting significant challenges for training deep learning models that require shuffled, memory-efficient data loading. While the AnnData format is the community standard for storing single-cell datasets, existing data loading solutions for AnnData are often inadequate: some require loading all data into memory, others convert to dense formats that increase storage demands, and many are hampered by slow random disk access. We present scDataset, a PyTorch IterableDataset that operates directly on one or more AnnData files without the need for format conversion. The core innovation is a combination of block sampling and batched fetching, which together balance randomness and I/O efficiency. On the Tahoe 100M dataset, scDataset achieves up to a 48$\times$ speed-up over AnnLoader, a 27$\times$ speed-up over HuggingFace Datasets, and an 18$\times$ speed-up over BioNeMo in single-core settings. These advances democratize large-scale single-cell model training for the broader research community.

Davide D'Ascenzo、Sebastiano Cultrera di Montesano

生物科学研究方法、生物科学研究技术计算技术、计算机技术

Davide D'Ascenzo,Sebastiano Cultrera di Montesano.scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics[EB/OL].(2025-06-02)[2025-07-16].https://arxiv.org/abs/2506.01883.点此复制

评论