XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites

Abstract Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences—plays a critical...

Full description

Saved in:
Bibliographic Details
Main Authors: Salman Khan, Sumaiya Noor, Tahir Javed, Afshan Naseem, Fahad Aslam, Salman A. AlQahtani, Nijad Ahmad
Format: Article
Language:English
Published: BMC 2025-02-01
Series:BioData Mining
Subjects:
Online Access:https://doi.org/10.1186/s13040-024-00415-8
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1825197657903071232
author Salman Khan
Sumaiya Noor
Tahir Javed
Afshan Naseem
Fahad Aslam
Salman A. AlQahtani
Nijad Ahmad
author_facet Salman Khan
Sumaiya Noor
Tahir Javed
Afshan Naseem
Fahad Aslam
Salman A. AlQahtani
Nijad Ahmad
author_sort Salman Khan
collection DOAJ
description Abstract Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences—plays a critical role in protein function. Identifying sumoylation sites is particularly important due to their links to Parkinson’s and Alzheimer’s. This study introduces XGBoost-Sumo, a robust model to predict sumoylation sites by integrating protein structure and sequence data. The model utilizes a transformer-based attention mechanism to encode peptides and extract evolutionary features through the PsePSSM-DWT approach. By fusing word embeddings with evolutionary descriptors, it applies the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an impressive accuracy of 99.68% on benchmark datasets using 10-fold cross-validation and 96.08% on independent samples. This marks a significant improvement, outperforming existing models by 10.31% on training data and 2.74% on independent tests. The model’s reliability and high performance make it a valuable resource for researchers, with strong potential for applications in pharmaceutical development.
format Article
id doaj-art-af46290e133248f2ab4f09bd7be5149d
institution Kabale University
issn 1756-0381
language English
publishDate 2025-02-01
publisher BMC
record_format Article
series BioData Mining
spelling doaj-art-af46290e133248f2ab4f09bd7be5149d2025-02-09T12:15:55ZengBMCBioData Mining1756-03812025-02-0118111810.1186/s13040-024-00415-8XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sitesSalman Khan0Sumaiya Noor1Tahir Javed2Afshan Naseem3Fahad Aslam4Salman A. AlQahtani5Nijad Ahmad6New Emerging Technologies and 5G Network and Beyond Research Chair, Department of Computer Engineering, College of Computer and Information Sciences, King Saud UniversityBusiness and Management Sciences Department, Purdue UniversityDepartment of Computer Science, Allama Iqbal Open UniversityInstitute of Oceanography and Environment (INOS), Universiti Malaysia TerengganuInstitute of Oceanography and Environment (INOS), Universiti Malaysia TerengganuNew Emerging Technologies and 5G Network and Beyond Research Chair, Department of Computer Engineering, College of Computer and Information Sciences, King Saud UniversityDepartment of Computer Science, Khurasan University JalalabadAbstract Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences—plays a critical role in protein function. Identifying sumoylation sites is particularly important due to their links to Parkinson’s and Alzheimer’s. This study introduces XGBoost-Sumo, a robust model to predict sumoylation sites by integrating protein structure and sequence data. The model utilizes a transformer-based attention mechanism to encode peptides and extract evolutionary features through the PsePSSM-DWT approach. By fusing word embeddings with evolutionary descriptors, it applies the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an impressive accuracy of 99.68% on benchmark datasets using 10-fold cross-validation and 96.08% on independent samples. This marks a significant improvement, outperforming existing models by 10.31% on training data and 2.74% on independent tests. The model’s reliability and high performance make it a valuable resource for researchers, with strong potential for applications in pharmaceutical development.https://doi.org/10.1186/s13040-024-00415-8Pseudo position-specific score matrixSumoylationPost-translation modificationXGBoostSHAP
spellingShingle Salman Khan
Sumaiya Noor
Tahir Javed
Afshan Naseem
Fahad Aslam
Salman A. AlQahtani
Nijad Ahmad
XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites
BioData Mining
Pseudo position-specific score matrix
Sumoylation
Post-translation modification
XGBoost
SHAP
title XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites
title_full XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites
title_fullStr XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites
title_full_unstemmed XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites
title_short XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites
title_sort xgboost enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites
topic Pseudo position-specific score matrix
Sumoylation
Post-translation modification
XGBoost
SHAP
url https://doi.org/10.1186/s13040-024-00415-8
work_keys_str_mv AT salmankhan xgboostenhancedensemblemodelusingdiscriminativehybridfeaturesforthepredictionofsumoylationsites
AT sumaiyanoor xgboostenhancedensemblemodelusingdiscriminativehybridfeaturesforthepredictionofsumoylationsites
AT tahirjaved xgboostenhancedensemblemodelusingdiscriminativehybridfeaturesforthepredictionofsumoylationsites
AT afshannaseem xgboostenhancedensemblemodelusingdiscriminativehybridfeaturesforthepredictionofsumoylationsites
AT fahadaslam xgboostenhancedensemblemodelusingdiscriminativehybridfeaturesforthepredictionofsumoylationsites
AT salmanaalqahtani xgboostenhancedensemblemodelusingdiscriminativehybridfeaturesforthepredictionofsumoylationsites
AT nijadahmad xgboostenhancedensemblemodelusingdiscriminativehybridfeaturesforthepredictionofsumoylationsites