ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring
Abstract The evolution and spread of antibiotic resistance pose a global health challenge. Whole genome and metagenomic sequencing offer a promising approach to monitoring the spread, but typical alignment-based approaches for antibiotic resistance gene (ARG) detection are inherently limited in the...
Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-08-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-14545-4 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849735952196435968 |
|---|---|
| author | Shafayat Ahmed Muhit Islam Emon Nazifa Ahmed Moumi Lifu Huang Dawei Zhou Peter Vikesland Amy Pruden Liqing Zhang |
| author_facet | Shafayat Ahmed Muhit Islam Emon Nazifa Ahmed Moumi Lifu Huang Dawei Zhou Peter Vikesland Amy Pruden Liqing Zhang |
| author_sort | Shafayat Ahmed |
| collection | DOAJ |
| description | Abstract The evolution and spread of antibiotic resistance pose a global health challenge. Whole genome and metagenomic sequencing offer a promising approach to monitoring the spread, but typical alignment-based approaches for antibiotic resistance gene (ARG) detection are inherently limited in the ability to detect new variants. Large protein language models could present a powerful alternative but are limited by databases available for training. Here we introduce ProtAlign-ARG, a novel hybrid model combining a pre-trained protein language model and an alignment scoring-based model to expand the capacity for ARG detection from DNA sequencing data. ProtAlign-ARG learns from vast unannotated protein sequences, utilizing raw protein language model embeddings to improve the accuracy of ARG classification. In instances where the model lacks confidence, ProtAlign-ARG employs an alignment-based scoring method, incorporating bit scores and e-values to classify ARGs according to their corresponding classes of antibiotics. ProtAlign-ARG demonstrated remarkable accuracy in identifying and classifying ARGs, particularly excelling in recall compared to existing ARG identification and classification tools. We also extended ProtAlign-ARG to predict the functionality and mobility of ARGs, highlighting the model’s robustness in various predictive tasks. A comprehensive comparison of ProtAlign-ARG with both the alignment-based scoring model and the pre-trained protein language model demonstrated the superior performance of ProtAlign-ARG. |
| format | Article |
| id | doaj-art-48642fb28bbb4d9bbf89cca582aeaeb2 |
| institution | DOAJ |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-48642fb28bbb4d9bbf89cca582aeaeb22025-08-20T03:07:24ZengNature PortfolioScientific Reports2045-23222025-08-0115111310.1038/s41598-025-14545-4ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoringShafayat Ahmed0Muhit Islam Emon1Nazifa Ahmed Moumi2Lifu Huang3Dawei Zhou4Peter Vikesland5Amy Pruden6Liqing Zhang7Department of Computer Science, Virginia Polytechnic Institute and State UniversityDepartment of Computer Science, Virginia Polytechnic Institute and State UniversityDepartment of Computer Science, Virginia Polytechnic Institute and State UniversityDepartment of Computer Science, Virginia Polytechnic Institute and State UniversityDepartment of Computer Science, Virginia Polytechnic Institute and State UniversityDepartment of Civil and Environmental Engineering, Virginia Polytechnic Institute and State UniversityDepartment of Civil and Environmental Engineering, Virginia Polytechnic Institute and State UniversityDepartment of Computer Science, Virginia Polytechnic Institute and State UniversityAbstract The evolution and spread of antibiotic resistance pose a global health challenge. Whole genome and metagenomic sequencing offer a promising approach to monitoring the spread, but typical alignment-based approaches for antibiotic resistance gene (ARG) detection are inherently limited in the ability to detect new variants. Large protein language models could present a powerful alternative but are limited by databases available for training. Here we introduce ProtAlign-ARG, a novel hybrid model combining a pre-trained protein language model and an alignment scoring-based model to expand the capacity for ARG detection from DNA sequencing data. ProtAlign-ARG learns from vast unannotated protein sequences, utilizing raw protein language model embeddings to improve the accuracy of ARG classification. In instances where the model lacks confidence, ProtAlign-ARG employs an alignment-based scoring method, incorporating bit scores and e-values to classify ARGs according to their corresponding classes of antibiotics. ProtAlign-ARG demonstrated remarkable accuracy in identifying and classifying ARGs, particularly excelling in recall compared to existing ARG identification and classification tools. We also extended ProtAlign-ARG to predict the functionality and mobility of ARGs, highlighting the model’s robustness in various predictive tasks. A comprehensive comparison of ProtAlign-ARG with both the alignment-based scoring model and the pre-trained protein language model demonstrated the superior performance of ProtAlign-ARG.https://doi.org/10.1038/s41598-025-14545-4ARGProtein language modelDeep learningProtein sequence |
| spellingShingle | Shafayat Ahmed Muhit Islam Emon Nazifa Ahmed Moumi Lifu Huang Dawei Zhou Peter Vikesland Amy Pruden Liqing Zhang ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring Scientific Reports ARG Protein language model Deep learning Protein sequence |
| title | ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring |
| title_full | ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring |
| title_fullStr | ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring |
| title_full_unstemmed | ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring |
| title_short | ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring |
| title_sort | protalign arg antibiotic resistance gene characterization integrating protein language models and alignment based scoring |
| topic | ARG Protein language model Deep learning Protein sequence |
| url | https://doi.org/10.1038/s41598-025-14545-4 |
| work_keys_str_mv | AT shafayatahmed protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring AT muhitislamemon protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring AT nazifaahmedmoumi protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring AT lifuhuang protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring AT daweizhou protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring AT petervikesland protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring AT amypruden protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring AT liqingzhang protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring |