annATAC: automatic cell type annotation for scATAC-seq data based on language model

Abstract Background Cell type annotation serves as the cornerstone for downstream analysis of single cell data. Nevertheless, scATAC-seq data is characterized by high sparsity and dimensionality, presenting significant challenges to its annotation process. Results We introduce a novel method based o...

Full description

Saved in:
Bibliographic Details
Main Authors: Lingyu Cui, Fang Wang, Hongfei Li, Qiaoming Liu, Murong Zhou, Guohua Wang
Format: Article
Language:English
Published: BMC 2025-05-01
Series:BMC Biology
Subjects:
Online Access:https://doi.org/10.1186/s12915-025-02244-5
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Background Cell type annotation serves as the cornerstone for downstream analysis of single cell data. Nevertheless, scATAC-seq data is characterized by high sparsity and dimensionality, presenting significant challenges to its annotation process. Results We introduce a novel method based on language model, named annATAC, which is designed for the automatic annotation of cell types in scATAC-seq data. This method primarily consists of three stages. During the pre-training stage, by training on a vast amount of unlabeled data, the model can learn the interaction relationships between peaks, thus building a preliminary understanding of the data features. Subsequently, in the fine-tuning stage, a small quantity of labeled data is utilized to conduct secondary training on the model, which enables the model to identify cell types accurately. Finally, in the prediction stage, the trained model is applied to annotate scATAC-seq data. Conclusions Compared with other automatic annotation methods across multiple datasets, annATAC demonstrates superiority on the annotation performance. Further experiments have validated that annATAC holds great potential in identifying marker peaks and marker motifs. It is expected that annATAC will provide more profound and precise analysis outcomes for scATAC-seq research. As a result, it will effectively promote the progress of relevant biomedical research. Graphical Abstract
ISSN:1741-7007