CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

Abstract Single-cell sequencing provides transcriptomic profiling at single-cell resolution, uncovering cellular heterogeneity with unprecedented precision. Yet, current single cell data analysis suffers from the inherent data noises, batch effects, and sparsity, highlighting the requirement of a un...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuansong Zeng, Jiancong Xie, Ningyuan Shangguan, Zhuoyi Wei, Wenbing Li, Yun Su, Shuangyu Yang, Chengyang Zhang, Jinbo Zhang, Nan Fang, Hongyu Zhang, Yutong Lu, Huiying Zhao, Jue Fan, Weijiang Yu, Yuedong Yang
Format: Article
Language:English
Published: Nature Portfolio 2025-05-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-025-59926-5
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850125804907790336
author Yuansong Zeng
Jiancong Xie
Ningyuan Shangguan
Zhuoyi Wei
Wenbing Li
Yun Su
Shuangyu Yang
Chengyang Zhang
Jinbo Zhang
Nan Fang
Hongyu Zhang
Yutong Lu
Huiying Zhao
Jue Fan
Weijiang Yu
Yuedong Yang
author_facet Yuansong Zeng
Jiancong Xie
Ningyuan Shangguan
Zhuoyi Wei
Wenbing Li
Yun Su
Shuangyu Yang
Chengyang Zhang
Jinbo Zhang
Nan Fang
Hongyu Zhang
Yutong Lu
Huiying Zhao
Jue Fan
Weijiang Yu
Yuedong Yang
author_sort Yuansong Zeng
collection DOAJ
description Abstract Single-cell sequencing provides transcriptomic profiling at single-cell resolution, uncovering cellular heterogeneity with unprecedented precision. Yet, current single cell data analysis suffers from the inherent data noises, batch effects, and sparsity, highlighting the requirement of a unified model to represent cellular states. To circumvent this problem, many recent efforts focus on training single-cell foundation models based on large datasets. However, current human foundation models are still limited by the sizes of training data and model parameters. Here, we have collected a diverse dataset of 100 million human cells, on which we train a single-cell foundation model (CellFM) containing 800 million parameters. To balance efficiency and performance, the model is trained through a modified RetNet framework on the MindSpore. Extensive experiments have shown that CellFM outperforms existing models in cell annotation, perturbation prediction, gene function prediction, and gene-gene relationship capturing.
format Article
id doaj-art-0a6286883ea947e0a6d493c947bb4a9d
institution OA Journals
issn 2041-1723
language English
publishDate 2025-05-01
publisher Nature Portfolio
record_format Article
series Nature Communications
spelling doaj-art-0a6286883ea947e0a6d493c947bb4a9d2025-08-20T02:34:04ZengNature PortfolioNature Communications2041-17232025-05-0116111710.1038/s41467-025-59926-5CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cellsYuansong Zeng0Jiancong Xie1Ningyuan Shangguan2Zhuoyi Wei3Wenbing Li4Yun Su5Shuangyu Yang6Chengyang Zhang7Jinbo Zhang8Nan Fang9Hongyu Zhang10Yutong Lu11Huiying Zhao12Jue Fan13Weijiang Yu14Yuedong Yang15School of Computer Science and Engineering, Sun Yat-sen UniversitySchool of Computer Science and Engineering, Sun Yat-sen UniversitySchool of Computer Science and Engineering, Sun Yat-sen UniversitySchool of Computer Science and Engineering, Sun Yat-sen UniversitySchool of Computer Science and Engineering, Sun Yat-sen UniversityHuawei Technologies Co., LtdDepartment of Medical Research Center, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen UniversitySchool of Big Data and Software Engineering, Chongqing UniversitySingleron Biotechnologies, NanjingSingleron Biotechnologies, NanjingSchool of Big Data and Software Engineering, Chongqing UniversitySchool of Computer Science and Engineering, Sun Yat-sen UniversityDepartment of Medical Research Center, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen UniversitySingleron Biotechnologies, NanjingSchool of Computer Science and Engineering, Sun Yat-sen UniversitySchool of Computer Science and Engineering, Sun Yat-sen UniversityAbstract Single-cell sequencing provides transcriptomic profiling at single-cell resolution, uncovering cellular heterogeneity with unprecedented precision. Yet, current single cell data analysis suffers from the inherent data noises, batch effects, and sparsity, highlighting the requirement of a unified model to represent cellular states. To circumvent this problem, many recent efforts focus on training single-cell foundation models based on large datasets. However, current human foundation models are still limited by the sizes of training data and model parameters. Here, we have collected a diverse dataset of 100 million human cells, on which we train a single-cell foundation model (CellFM) containing 800 million parameters. To balance efficiency and performance, the model is trained through a modified RetNet framework on the MindSpore. Extensive experiments have shown that CellFM outperforms existing models in cell annotation, perturbation prediction, gene function prediction, and gene-gene relationship capturing.https://doi.org/10.1038/s41467-025-59926-5
spellingShingle Yuansong Zeng
Jiancong Xie
Ningyuan Shangguan
Zhuoyi Wei
Wenbing Li
Yun Su
Shuangyu Yang
Chengyang Zhang
Jinbo Zhang
Nan Fang
Hongyu Zhang
Yutong Lu
Huiying Zhao
Jue Fan
Weijiang Yu
Yuedong Yang
CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells
Nature Communications
title CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells
title_full CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells
title_fullStr CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells
title_full_unstemmed CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells
title_short CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells
title_sort cellfm a large scale foundation model pre trained on transcriptomics of 100 million human cells
url https://doi.org/10.1038/s41467-025-59926-5
work_keys_str_mv AT yuansongzeng cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT jiancongxie cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT ningyuanshangguan cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT zhuoyiwei cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT wenbingli cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT yunsu cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT shuangyuyang cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT chengyangzhang cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT jinbozhang cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT nanfang cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT hongyuzhang cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT yutonglu cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT huiyingzhao cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT juefan cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT weijiangyu cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells
AT yuedongyang cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells