|国家预印本平台
首页|CellPLM: Pre-training of Cell Language Model Beyond Single Cells

CellPLM: Pre-training of Cell Language Model Beyond Single Cells

CellPLM: Pre-training of Cell Language Model Beyond Single Cells

来源:bioRxiv_logobioRxiv
英文摘要

The current state-of-the-art single-cell pre-trained models are greatly inspired by the success of large language models. They trained transformers by treating genes as tokens and cells as sentences. However, three fundamental differences between single-cell data and natural language data are overlooked: (1) scRNA-seq data are presented as bag-of-genes instead of sequences of RNAs; (2) Cell-cell relations are more intricate and important than inter-sentence relations; and (3) The quantity of single-cell data is considerably inferior to text data, and they are very noisy. In light of these characteristics, we propose a new pre-trained model CellPLM, which takes cells as tokens and tissues as sentences. In addition, we leverage spatially-resolved transcriptomic data in pre-training to facilitate learning cell-cell relationships and introduce a Gaussian mixture prior distribution as an additional inductive bias to overcome data limitation. CellPLM is the first single-cell pre-trained transformer that encodes cell-cell relations and it consistently outperforms existing pre-trained and non-pre-trained models in diverse downstream tasks, with 100x times higher inference speed compared to existing pre-trained models.

Tang Jiliang、Tang Wenzhuo、Jin Wei、Xie Yuying、Wen Hongzhi、Ding Jiayuan、Dai Xinnan

10.1101/2023.10.03.560734

细胞生物学计算技术、计算机技术分子生物学

Tang Jiliang,Tang Wenzhuo,Jin Wei,Xie Yuying,Wen Hongzhi,Ding Jiayuan,Dai Xinnan.CellPLM: Pre-training of Cell Language Model Beyond Single Cells[EB/OL].(2025-03-28)[2025-05-15].https://www.biorxiv.org/content/10.1101/2023.10.03.560734.点此复制

评论