|国家预印本平台
首页|scBERT is a Large-scale Pretrained Deep Language Model for Cell Type Annotation of Single-cell RNA-seq Data

scBERT is a Large-scale Pretrained Deep Language Model for Cell Type Annotation of Single-cell RNA-seq Data

scBERT is a Large-scale Pretrained Deep Language Model for Cell Type Annotation of Single-cell RNA-seq Data

来源:bioRxiv_logobioRxiv
英文摘要

Abstract Annotating cell types based on the single-cell RNA-seq data is a prerequisite for researches on disease progress and tumor microenvironment. Here we show existing annotation methods typically suffer from lack of curated marker gene lists, improper handling of batch effect, and difficulty in leveraging the latent gene-gene interaction information, impairing their generalization and robustness. Inspired by BERT’s revolutionary recipe, we developed a pre-trained deep neural network-based model scBERT (single-cell Bidirectional Encoder Representations from Transformers) to overcome the challenges. Following BERT’s paradigm of pre-train and fine-tune, scBERT obtains a general understanding of gene-gene interaction by being pre-trained on huge amounts of unlabeled scRNA-seq data and is transferred to the cell type annotation task of unseen and user-specific scRNA-seq data for supervised fine-tuning. Extensive and rigorous benchmark studies validated the superior performance of scBERT on cell type annotation, novel cell type discovery, robustness to batch effect, and model interpretability.

Yao Jianhua、Yang Fan、Huang Junzhou、Wang Fang、Tang Duyu、Wang Wenchuan、Lu Hui、Fang Yuan

AI LabAI LabAI LabAI LabAI LabAI Lab||SJTU-Yale Joint Center for Biostatistics and Data Science, Department of Bioinformatics and Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong UniversitySJTU-Yale Joint Center for Biostatistics and Data Science, Department of Bioinformatics and Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong University||Center for Biomedical Informatics, Shanghai Engineering Research Center for Big Data in Pediatric Precision Medicine, Shanghai Children?ˉs HospitalAI Lab||Department of Molecular and Cellular Biology, Harvard University

10.1101/2021.12.05.471261

生物科学研究方法、生物科学研究技术细胞生物学分子生物学

Yao Jianhua,Yang Fan,Huang Junzhou,Wang Fang,Tang Duyu,Wang Wenchuan,Lu Hui,Fang Yuan.scBERT is a Large-scale Pretrained Deep Language Model for Cell Type Annotation of Single-cell RNA-seq Data[EB/OL].(2025-03-28)[2025-05-28].https://www.biorxiv.org/content/10.1101/2021.12.05.471261.点此复制

评论