Estimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning models

Abstract This study presents a comprehensive approach to predicting solubility of recombinant protein in four E. coli samples by employing machine learning techniques and optimization algorithms. Various models, including AdaBoost, Decision Tree Regression (DT), Gaussian Process Regression (GPR), an...

Full description

Saved in:
Bibliographic Details
Main Authors: Wael A. Mahdi, Adel Alhowyan, Ahmad J. Obaidullah
Format: Article
Language:English
Published: Nature Portfolio 2025-04-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-97445-x
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850156597628633088
author Wael A. Mahdi
Adel Alhowyan
Ahmad J. Obaidullah
author_facet Wael A. Mahdi
Adel Alhowyan
Ahmad J. Obaidullah
author_sort Wael A. Mahdi
collection DOAJ
description Abstract This study presents a comprehensive approach to predicting solubility of recombinant protein in four E. coli samples by employing machine learning techniques and optimization algorithms. Various models, including AdaBoost, Decision Tree Regression (DT), Gaussian Process Regression (GPR), and K-Nearest Neighbors (KNN) are applied to capture the intricate relationships between experimental factors and protein solubility. The integration of these models within an AdaBoost framework, coupled with advanced hyperparameter tuning via the Firefly Algorithm (FA), demonstrates a novel strategy for improving predictive accuracy and model robustness. Key preprocessing techniques such as Histogram-Based Outlier Detection (HBOD) and Z-score normalization are employed to ensure data integrity and consistency. The Firefly Algorithm (FA), utilizing 5-fold cross-validation as the fitness function, adeptly navigates complex hyperparameter spaces, enhancing model performance across diverse data partitions. The AdaBoost with Gaussian Process Regression (ADA-GPR) model established to be superior to alternatives including ADA-DT and ADA-KNN, demonstrating great performance through high R2 test scores and low Mean Squared Error. With a standard deviation of 0.05188 across 5-fold cross-validation, ADA-GPR demonstrated exceptional consistency and robust generalization across diverse data partitions. Using hybrid optimization, this study sheds light on critical variables influencing protein solubility, providing a scalable and effective solution for modeling bioprocesses.
format Article
id doaj-art-b6ebfedf44884e2ca90eb31bafd9fda7
institution OA Journals
issn 2045-2322
language English
publishDate 2025-04-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-b6ebfedf44884e2ca90eb31bafd9fda72025-08-20T02:24:29ZengNature PortfolioScientific Reports2045-23222025-04-0115111210.1038/s41598-025-97445-xEstimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning modelsWael A. Mahdi0Adel Alhowyan1Ahmad J. Obaidullah2Department of Pharmaceutics, College of Pharmacy, King Saud UniversityDepartment of Pharmaceutics, College of Pharmacy, King Saud UniversityDepartment of Pharmaceutical Chemistry, College of Pharmacy, King Saud UniversityAbstract This study presents a comprehensive approach to predicting solubility of recombinant protein in four E. coli samples by employing machine learning techniques and optimization algorithms. Various models, including AdaBoost, Decision Tree Regression (DT), Gaussian Process Regression (GPR), and K-Nearest Neighbors (KNN) are applied to capture the intricate relationships between experimental factors and protein solubility. The integration of these models within an AdaBoost framework, coupled with advanced hyperparameter tuning via the Firefly Algorithm (FA), demonstrates a novel strategy for improving predictive accuracy and model robustness. Key preprocessing techniques such as Histogram-Based Outlier Detection (HBOD) and Z-score normalization are employed to ensure data integrity and consistency. The Firefly Algorithm (FA), utilizing 5-fold cross-validation as the fitness function, adeptly navigates complex hyperparameter spaces, enhancing model performance across diverse data partitions. The AdaBoost with Gaussian Process Regression (ADA-GPR) model established to be superior to alternatives including ADA-DT and ADA-KNN, demonstrating great performance through high R2 test scores and low Mean Squared Error. With a standard deviation of 0.05188 across 5-fold cross-validation, ADA-GPR demonstrated exceptional consistency and robust generalization across diverse data partitions. Using hybrid optimization, this study sheds light on critical variables influencing protein solubility, providing a scalable and effective solution for modeling bioprocesses.https://doi.org/10.1038/s41598-025-97445-xRecombinant proteinSolubilityDecision tree regressionK-Nearest neighborsGaussian process regression
spellingShingle Wael A. Mahdi
Adel Alhowyan
Ahmad J. Obaidullah
Estimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning models
Scientific Reports
Recombinant protein
Solubility
Decision tree regression
K-Nearest neighbors
Gaussian process regression
title Estimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning models
title_full Estimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning models
title_fullStr Estimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning models
title_full_unstemmed Estimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning models
title_short Estimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning models
title_sort estimation and validation of solubility of recombinant protein in e coli strains via various advanced machine learning models
topic Recombinant protein
Solubility
Decision tree regression
K-Nearest neighbors
Gaussian process regression
url https://doi.org/10.1038/s41598-025-97445-x
work_keys_str_mv AT waelamahdi estimationandvalidationofsolubilityofrecombinantproteininecolistrainsviavariousadvancedmachinelearningmodels
AT adelalhowyan estimationandvalidationofsolubilityofrecombinantproteininecolistrainsviavariousadvancedmachinelearningmodels
AT ahmadjobaidullah estimationandvalidationofsolubilityofrecombinantproteininecolistrainsviavariousadvancedmachinelearningmodels