CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells
Abstract Single-cell sequencing provides transcriptomic profiling at single-cell resolution, uncovering cellular heterogeneity with unprecedented precision. Yet, current single cell data analysis suffers from the inherent data noises, batch effects, and sparsity, highlighting the requirement of a un...
Saved in:
| Main Authors: | , , , , , , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-05-01
|
| Series: | Nature Communications |
| Online Access: | https://doi.org/10.1038/s41467-025-59926-5 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850125804907790336 |
|---|---|
| author | Yuansong Zeng Jiancong Xie Ningyuan Shangguan Zhuoyi Wei Wenbing Li Yun Su Shuangyu Yang Chengyang Zhang Jinbo Zhang Nan Fang Hongyu Zhang Yutong Lu Huiying Zhao Jue Fan Weijiang Yu Yuedong Yang |
| author_facet | Yuansong Zeng Jiancong Xie Ningyuan Shangguan Zhuoyi Wei Wenbing Li Yun Su Shuangyu Yang Chengyang Zhang Jinbo Zhang Nan Fang Hongyu Zhang Yutong Lu Huiying Zhao Jue Fan Weijiang Yu Yuedong Yang |
| author_sort | Yuansong Zeng |
| collection | DOAJ |
| description | Abstract Single-cell sequencing provides transcriptomic profiling at single-cell resolution, uncovering cellular heterogeneity with unprecedented precision. Yet, current single cell data analysis suffers from the inherent data noises, batch effects, and sparsity, highlighting the requirement of a unified model to represent cellular states. To circumvent this problem, many recent efforts focus on training single-cell foundation models based on large datasets. However, current human foundation models are still limited by the sizes of training data and model parameters. Here, we have collected a diverse dataset of 100 million human cells, on which we train a single-cell foundation model (CellFM) containing 800 million parameters. To balance efficiency and performance, the model is trained through a modified RetNet framework on the MindSpore. Extensive experiments have shown that CellFM outperforms existing models in cell annotation, perturbation prediction, gene function prediction, and gene-gene relationship capturing. |
| format | Article |
| id | doaj-art-0a6286883ea947e0a6d493c947bb4a9d |
| institution | OA Journals |
| issn | 2041-1723 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Nature Communications |
| spelling | doaj-art-0a6286883ea947e0a6d493c947bb4a9d2025-08-20T02:34:04ZengNature PortfolioNature Communications2041-17232025-05-0116111710.1038/s41467-025-59926-5CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cellsYuansong Zeng0Jiancong Xie1Ningyuan Shangguan2Zhuoyi Wei3Wenbing Li4Yun Su5Shuangyu Yang6Chengyang Zhang7Jinbo Zhang8Nan Fang9Hongyu Zhang10Yutong Lu11Huiying Zhao12Jue Fan13Weijiang Yu14Yuedong Yang15School of Computer Science and Engineering, Sun Yat-sen UniversitySchool of Computer Science and Engineering, Sun Yat-sen UniversitySchool of Computer Science and Engineering, Sun Yat-sen UniversitySchool of Computer Science and Engineering, Sun Yat-sen UniversitySchool of Computer Science and Engineering, Sun Yat-sen UniversityHuawei Technologies Co., LtdDepartment of Medical Research Center, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen UniversitySchool of Big Data and Software Engineering, Chongqing UniversitySingleron Biotechnologies, NanjingSingleron Biotechnologies, NanjingSchool of Big Data and Software Engineering, Chongqing UniversitySchool of Computer Science and Engineering, Sun Yat-sen UniversityDepartment of Medical Research Center, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen UniversitySingleron Biotechnologies, NanjingSchool of Computer Science and Engineering, Sun Yat-sen UniversitySchool of Computer Science and Engineering, Sun Yat-sen UniversityAbstract Single-cell sequencing provides transcriptomic profiling at single-cell resolution, uncovering cellular heterogeneity with unprecedented precision. Yet, current single cell data analysis suffers from the inherent data noises, batch effects, and sparsity, highlighting the requirement of a unified model to represent cellular states. To circumvent this problem, many recent efforts focus on training single-cell foundation models based on large datasets. However, current human foundation models are still limited by the sizes of training data and model parameters. Here, we have collected a diverse dataset of 100 million human cells, on which we train a single-cell foundation model (CellFM) containing 800 million parameters. To balance efficiency and performance, the model is trained through a modified RetNet framework on the MindSpore. Extensive experiments have shown that CellFM outperforms existing models in cell annotation, perturbation prediction, gene function prediction, and gene-gene relationship capturing.https://doi.org/10.1038/s41467-025-59926-5 |
| spellingShingle | Yuansong Zeng Jiancong Xie Ningyuan Shangguan Zhuoyi Wei Wenbing Li Yun Su Shuangyu Yang Chengyang Zhang Jinbo Zhang Nan Fang Hongyu Zhang Yutong Lu Huiying Zhao Jue Fan Weijiang Yu Yuedong Yang CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells Nature Communications |
| title | CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells |
| title_full | CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells |
| title_fullStr | CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells |
| title_full_unstemmed | CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells |
| title_short | CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells |
| title_sort | cellfm a large scale foundation model pre trained on transcriptomics of 100 million human cells |
| url | https://doi.org/10.1038/s41467-025-59926-5 |
| work_keys_str_mv | AT yuansongzeng cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT jiancongxie cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT ningyuanshangguan cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT zhuoyiwei cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT wenbingli cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT yunsu cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT shuangyuyang cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT chengyangzhang cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT jinbozhang cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT nanfang cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT hongyuzhang cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT yutonglu cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT huiyingzhao cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT juefan cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT weijiangyu cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells AT yuedongyang cellfmalargescalefoundationmodelpretrainedontranscriptomicsof100millionhumancells |