A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding

Abstract The CRISPR-Cas system, an adaptive immune mechanism found in bacteria and archaea, has evolved into a promising genomic editing tool, with various types of Cas proteins playing a crucial role. In this study, we developed a set of strategies for mining and identifying Cas1 proteins. Firstly,...

Full description

Saved in:
Bibliographic Details
Main Authors: Gaoxiang Chen, Liya Hou, Zhanwei Li, Bin Xie, Yongqiang Liu
Format: Article
Language:English
Published: Nature Portfolio 2025-04-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-99999-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850042952818098176
author Gaoxiang Chen
Liya Hou
Zhanwei Li
Bin Xie
Yongqiang Liu
author_facet Gaoxiang Chen
Liya Hou
Zhanwei Li
Bin Xie
Yongqiang Liu
author_sort Gaoxiang Chen
collection DOAJ
description Abstract The CRISPR-Cas system, an adaptive immune mechanism found in bacteria and archaea, has evolved into a promising genomic editing tool, with various types of Cas proteins playing a crucial role. In this study, we developed a set of strategies for mining and identifying Cas1 proteins. Firstly, we analyzed the characteristic differences of 14 types of Cas proteins in the protein large language model embedding space in detail; then converted proteins into the Simplified Molecular Input Line Entry System (SMILES) format, thereby constructing graph data representing atom and bond features. Next, based on the characteristic differences of different Cas proteins, we designed and trained an ensemble model composed of two Directed Message Passing Neural Network (DMPNN) models for high-precision identification of Cas1 proteins. This ensemble model performed excellently on both training data and newly designed datasets. The comparison of this method with other methods, such as CRISPRCasFinder, has demonstrated its effectiveness. Finally, the ensemble model was successfully employed to identify potential Cas1 proteins in the Ensemble database, further highlighting its robustness and practicality. The strategies and models from this research may potentially be extended to other types of Cas proteins, though this would require further investigation and validation. Moreover, our work highlights SMILES encoding as a versatile tool for studying biological macromolecules, enabling efficient structural representation and advanced computational applications in protein research and beyond.
format Article
id doaj-art-64c4730cbc8a4d39a3246cdd3a1205ac
institution DOAJ
issn 2045-2322
language English
publishDate 2025-04-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-64c4730cbc8a4d39a3246cdd3a1205ac2025-08-20T02:55:21ZengNature PortfolioScientific Reports2045-23222025-04-0115112110.1038/s41598-025-99999-2A new strategy for Cas protein recognition based on graph neural networks and SMILES encodingGaoxiang Chen0Liya Hou1Zhanwei Li2Bin Xie3Yongqiang Liu4Zhejiang Laboratory, Research Center for Life Sciences ComputingZhejiang Laboratory, Research Center for Life Sciences ComputingZhejiang Laboratory, Research Center for Life Sciences ComputingZhejiang Laboratory, Research Center for Life Sciences ComputingZhejiang Laboratory, Research Center for Life Sciences ComputingAbstract The CRISPR-Cas system, an adaptive immune mechanism found in bacteria and archaea, has evolved into a promising genomic editing tool, with various types of Cas proteins playing a crucial role. In this study, we developed a set of strategies for mining and identifying Cas1 proteins. Firstly, we analyzed the characteristic differences of 14 types of Cas proteins in the protein large language model embedding space in detail; then converted proteins into the Simplified Molecular Input Line Entry System (SMILES) format, thereby constructing graph data representing atom and bond features. Next, based on the characteristic differences of different Cas proteins, we designed and trained an ensemble model composed of two Directed Message Passing Neural Network (DMPNN) models for high-precision identification of Cas1 proteins. This ensemble model performed excellently on both training data and newly designed datasets. The comparison of this method with other methods, such as CRISPRCasFinder, has demonstrated its effectiveness. Finally, the ensemble model was successfully employed to identify potential Cas1 proteins in the Ensemble database, further highlighting its robustness and practicality. The strategies and models from this research may potentially be extended to other types of Cas proteins, though this would require further investigation and validation. Moreover, our work highlights SMILES encoding as a versatile tool for studying biological macromolecules, enabling efficient structural representation and advanced computational applications in protein research and beyond.https://doi.org/10.1038/s41598-025-99999-2CRISPR-associated protein 1 (Cas1)Graph neural networks (GNNs)Directed message passing neural networks (DMPNN)The simplified molecular-input line-entry system (SMILES)
spellingShingle Gaoxiang Chen
Liya Hou
Zhanwei Li
Bin Xie
Yongqiang Liu
A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding
Scientific Reports
CRISPR-associated protein 1 (Cas1)
Graph neural networks (GNNs)
Directed message passing neural networks (DMPNN)
The simplified molecular-input line-entry system (SMILES)
title A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding
title_full A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding
title_fullStr A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding
title_full_unstemmed A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding
title_short A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding
title_sort new strategy for cas protein recognition based on graph neural networks and smiles encoding
topic CRISPR-associated protein 1 (Cas1)
Graph neural networks (GNNs)
Directed message passing neural networks (DMPNN)
The simplified molecular-input line-entry system (SMILES)
url https://doi.org/10.1038/s41598-025-99999-2
work_keys_str_mv AT gaoxiangchen anewstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding
AT liyahou anewstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding
AT zhanweili anewstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding
AT binxie anewstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding
AT yongqiangliu anewstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding
AT gaoxiangchen newstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding
AT liyahou newstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding
AT zhanweili newstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding
AT binxie newstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding
AT yongqiangliu newstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding