ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring

Abstract The evolution and spread of antibiotic resistance pose a global health challenge. Whole genome and metagenomic sequencing offer a promising approach to monitoring the spread, but typical alignment-based approaches for antibiotic resistance gene (ARG) detection are inherently limited in the...

Full description

Saved in:
Bibliographic Details
Main Authors: Shafayat Ahmed, Muhit Islam Emon, Nazifa Ahmed Moumi, Lifu Huang, Dawei Zhou, Peter Vikesland, Amy Pruden, Liqing Zhang
Format: Article
Language:English
Published: Nature Portfolio 2025-08-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-14545-4
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849735952196435968
author Shafayat Ahmed
Muhit Islam Emon
Nazifa Ahmed Moumi
Lifu Huang
Dawei Zhou
Peter Vikesland
Amy Pruden
Liqing Zhang
author_facet Shafayat Ahmed
Muhit Islam Emon
Nazifa Ahmed Moumi
Lifu Huang
Dawei Zhou
Peter Vikesland
Amy Pruden
Liqing Zhang
author_sort Shafayat Ahmed
collection DOAJ
description Abstract The evolution and spread of antibiotic resistance pose a global health challenge. Whole genome and metagenomic sequencing offer a promising approach to monitoring the spread, but typical alignment-based approaches for antibiotic resistance gene (ARG) detection are inherently limited in the ability to detect new variants. Large protein language models could present a powerful alternative but are limited by databases available for training. Here we introduce ProtAlign-ARG, a novel hybrid model combining a pre-trained protein language model and an alignment scoring-based model to expand the capacity for ARG detection from DNA sequencing data. ProtAlign-ARG learns from vast unannotated protein sequences, utilizing raw protein language model embeddings to improve the accuracy of ARG classification. In instances where the model lacks confidence, ProtAlign-ARG employs an alignment-based scoring method, incorporating bit scores and e-values to classify ARGs according to their corresponding classes of antibiotics. ProtAlign-ARG demonstrated remarkable accuracy in identifying and classifying ARGs, particularly excelling in recall compared to existing ARG identification and classification tools. We also extended ProtAlign-ARG to predict the functionality and mobility of ARGs, highlighting the model’s robustness in various predictive tasks. A comprehensive comparison of ProtAlign-ARG with both the alignment-based scoring model and the pre-trained protein language model demonstrated the superior performance of ProtAlign-ARG.
format Article
id doaj-art-48642fb28bbb4d9bbf89cca582aeaeb2
institution DOAJ
issn 2045-2322
language English
publishDate 2025-08-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-48642fb28bbb4d9bbf89cca582aeaeb22025-08-20T03:07:24ZengNature PortfolioScientific Reports2045-23222025-08-0115111310.1038/s41598-025-14545-4ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoringShafayat Ahmed0Muhit Islam Emon1Nazifa Ahmed Moumi2Lifu Huang3Dawei Zhou4Peter Vikesland5Amy Pruden6Liqing Zhang7Department of Computer Science, Virginia Polytechnic Institute and State UniversityDepartment of Computer Science, Virginia Polytechnic Institute and State UniversityDepartment of Computer Science, Virginia Polytechnic Institute and State UniversityDepartment of Computer Science, Virginia Polytechnic Institute and State UniversityDepartment of Computer Science, Virginia Polytechnic Institute and State UniversityDepartment of Civil and Environmental Engineering, Virginia Polytechnic Institute and State UniversityDepartment of Civil and Environmental Engineering, Virginia Polytechnic Institute and State UniversityDepartment of Computer Science, Virginia Polytechnic Institute and State UniversityAbstract The evolution and spread of antibiotic resistance pose a global health challenge. Whole genome and metagenomic sequencing offer a promising approach to monitoring the spread, but typical alignment-based approaches for antibiotic resistance gene (ARG) detection are inherently limited in the ability to detect new variants. Large protein language models could present a powerful alternative but are limited by databases available for training. Here we introduce ProtAlign-ARG, a novel hybrid model combining a pre-trained protein language model and an alignment scoring-based model to expand the capacity for ARG detection from DNA sequencing data. ProtAlign-ARG learns from vast unannotated protein sequences, utilizing raw protein language model embeddings to improve the accuracy of ARG classification. In instances where the model lacks confidence, ProtAlign-ARG employs an alignment-based scoring method, incorporating bit scores and e-values to classify ARGs according to their corresponding classes of antibiotics. ProtAlign-ARG demonstrated remarkable accuracy in identifying and classifying ARGs, particularly excelling in recall compared to existing ARG identification and classification tools. We also extended ProtAlign-ARG to predict the functionality and mobility of ARGs, highlighting the model’s robustness in various predictive tasks. A comprehensive comparison of ProtAlign-ARG with both the alignment-based scoring model and the pre-trained protein language model demonstrated the superior performance of ProtAlign-ARG.https://doi.org/10.1038/s41598-025-14545-4ARGProtein language modelDeep learningProtein sequence
spellingShingle Shafayat Ahmed
Muhit Islam Emon
Nazifa Ahmed Moumi
Lifu Huang
Dawei Zhou
Peter Vikesland
Amy Pruden
Liqing Zhang
ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring
Scientific Reports
ARG
Protein language model
Deep learning
Protein sequence
title ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring
title_full ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring
title_fullStr ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring
title_full_unstemmed ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring
title_short ProtAlign-ARG: antibiotic resistance gene characterization integrating protein language models and alignment-based scoring
title_sort protalign arg antibiotic resistance gene characterization integrating protein language models and alignment based scoring
topic ARG
Protein language model
Deep learning
Protein sequence
url https://doi.org/10.1038/s41598-025-14545-4
work_keys_str_mv AT shafayatahmed protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring
AT muhitislamemon protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring
AT nazifaahmedmoumi protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring
AT lifuhuang protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring
AT daweizhou protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring
AT petervikesland protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring
AT amypruden protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring
AT liqingzhang protalignargantibioticresistancegenecharacterizationintegratingproteinlanguagemodelsandalignmentbasedscoring