A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding
Abstract The CRISPR-Cas system, an adaptive immune mechanism found in bacteria and archaea, has evolved into a promising genomic editing tool, with various types of Cas proteins playing a crucial role. In this study, we developed a set of strategies for mining and identifying Cas1 proteins. Firstly,...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-04-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-99999-2 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850042952818098176 |
|---|---|
| author | Gaoxiang Chen Liya Hou Zhanwei Li Bin Xie Yongqiang Liu |
| author_facet | Gaoxiang Chen Liya Hou Zhanwei Li Bin Xie Yongqiang Liu |
| author_sort | Gaoxiang Chen |
| collection | DOAJ |
| description | Abstract The CRISPR-Cas system, an adaptive immune mechanism found in bacteria and archaea, has evolved into a promising genomic editing tool, with various types of Cas proteins playing a crucial role. In this study, we developed a set of strategies for mining and identifying Cas1 proteins. Firstly, we analyzed the characteristic differences of 14 types of Cas proteins in the protein large language model embedding space in detail; then converted proteins into the Simplified Molecular Input Line Entry System (SMILES) format, thereby constructing graph data representing atom and bond features. Next, based on the characteristic differences of different Cas proteins, we designed and trained an ensemble model composed of two Directed Message Passing Neural Network (DMPNN) models for high-precision identification of Cas1 proteins. This ensemble model performed excellently on both training data and newly designed datasets. The comparison of this method with other methods, such as CRISPRCasFinder, has demonstrated its effectiveness. Finally, the ensemble model was successfully employed to identify potential Cas1 proteins in the Ensemble database, further highlighting its robustness and practicality. The strategies and models from this research may potentially be extended to other types of Cas proteins, though this would require further investigation and validation. Moreover, our work highlights SMILES encoding as a versatile tool for studying biological macromolecules, enabling efficient structural representation and advanced computational applications in protein research and beyond. |
| format | Article |
| id | doaj-art-64c4730cbc8a4d39a3246cdd3a1205ac |
| institution | DOAJ |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-64c4730cbc8a4d39a3246cdd3a1205ac2025-08-20T02:55:21ZengNature PortfolioScientific Reports2045-23222025-04-0115112110.1038/s41598-025-99999-2A new strategy for Cas protein recognition based on graph neural networks and SMILES encodingGaoxiang Chen0Liya Hou1Zhanwei Li2Bin Xie3Yongqiang Liu4Zhejiang Laboratory, Research Center for Life Sciences ComputingZhejiang Laboratory, Research Center for Life Sciences ComputingZhejiang Laboratory, Research Center for Life Sciences ComputingZhejiang Laboratory, Research Center for Life Sciences ComputingZhejiang Laboratory, Research Center for Life Sciences ComputingAbstract The CRISPR-Cas system, an adaptive immune mechanism found in bacteria and archaea, has evolved into a promising genomic editing tool, with various types of Cas proteins playing a crucial role. In this study, we developed a set of strategies for mining and identifying Cas1 proteins. Firstly, we analyzed the characteristic differences of 14 types of Cas proteins in the protein large language model embedding space in detail; then converted proteins into the Simplified Molecular Input Line Entry System (SMILES) format, thereby constructing graph data representing atom and bond features. Next, based on the characteristic differences of different Cas proteins, we designed and trained an ensemble model composed of two Directed Message Passing Neural Network (DMPNN) models for high-precision identification of Cas1 proteins. This ensemble model performed excellently on both training data and newly designed datasets. The comparison of this method with other methods, such as CRISPRCasFinder, has demonstrated its effectiveness. Finally, the ensemble model was successfully employed to identify potential Cas1 proteins in the Ensemble database, further highlighting its robustness and practicality. The strategies and models from this research may potentially be extended to other types of Cas proteins, though this would require further investigation and validation. Moreover, our work highlights SMILES encoding as a versatile tool for studying biological macromolecules, enabling efficient structural representation and advanced computational applications in protein research and beyond.https://doi.org/10.1038/s41598-025-99999-2CRISPR-associated protein 1 (Cas1)Graph neural networks (GNNs)Directed message passing neural networks (DMPNN)The simplified molecular-input line-entry system (SMILES) |
| spellingShingle | Gaoxiang Chen Liya Hou Zhanwei Li Bin Xie Yongqiang Liu A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding Scientific Reports CRISPR-associated protein 1 (Cas1) Graph neural networks (GNNs) Directed message passing neural networks (DMPNN) The simplified molecular-input line-entry system (SMILES) |
| title | A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding |
| title_full | A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding |
| title_fullStr | A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding |
| title_full_unstemmed | A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding |
| title_short | A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding |
| title_sort | new strategy for cas protein recognition based on graph neural networks and smiles encoding |
| topic | CRISPR-associated protein 1 (Cas1) Graph neural networks (GNNs) Directed message passing neural networks (DMPNN) The simplified molecular-input line-entry system (SMILES) |
| url | https://doi.org/10.1038/s41598-025-99999-2 |
| work_keys_str_mv | AT gaoxiangchen anewstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding AT liyahou anewstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding AT zhanweili anewstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding AT binxie anewstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding AT yongqiangliu anewstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding AT gaoxiangchen newstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding AT liyahou newstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding AT zhanweili newstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding AT binxie newstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding AT yongqiangliu newstrategyforcasproteinrecognitionbasedongraphneuralnetworksandsmilesencoding |