Estimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning models
Abstract This study presents a comprehensive approach to predicting solubility of recombinant protein in four E. coli samples by employing machine learning techniques and optimization algorithms. Various models, including AdaBoost, Decision Tree Regression (DT), Gaussian Process Regression (GPR), an...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-04-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-97445-x |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850156597628633088 |
|---|---|
| author | Wael A. Mahdi Adel Alhowyan Ahmad J. Obaidullah |
| author_facet | Wael A. Mahdi Adel Alhowyan Ahmad J. Obaidullah |
| author_sort | Wael A. Mahdi |
| collection | DOAJ |
| description | Abstract This study presents a comprehensive approach to predicting solubility of recombinant protein in four E. coli samples by employing machine learning techniques and optimization algorithms. Various models, including AdaBoost, Decision Tree Regression (DT), Gaussian Process Regression (GPR), and K-Nearest Neighbors (KNN) are applied to capture the intricate relationships between experimental factors and protein solubility. The integration of these models within an AdaBoost framework, coupled with advanced hyperparameter tuning via the Firefly Algorithm (FA), demonstrates a novel strategy for improving predictive accuracy and model robustness. Key preprocessing techniques such as Histogram-Based Outlier Detection (HBOD) and Z-score normalization are employed to ensure data integrity and consistency. The Firefly Algorithm (FA), utilizing 5-fold cross-validation as the fitness function, adeptly navigates complex hyperparameter spaces, enhancing model performance across diverse data partitions. The AdaBoost with Gaussian Process Regression (ADA-GPR) model established to be superior to alternatives including ADA-DT and ADA-KNN, demonstrating great performance through high R2 test scores and low Mean Squared Error. With a standard deviation of 0.05188 across 5-fold cross-validation, ADA-GPR demonstrated exceptional consistency and robust generalization across diverse data partitions. Using hybrid optimization, this study sheds light on critical variables influencing protein solubility, providing a scalable and effective solution for modeling bioprocesses. |
| format | Article |
| id | doaj-art-b6ebfedf44884e2ca90eb31bafd9fda7 |
| institution | OA Journals |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-b6ebfedf44884e2ca90eb31bafd9fda72025-08-20T02:24:29ZengNature PortfolioScientific Reports2045-23222025-04-0115111210.1038/s41598-025-97445-xEstimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning modelsWael A. Mahdi0Adel Alhowyan1Ahmad J. Obaidullah2Department of Pharmaceutics, College of Pharmacy, King Saud UniversityDepartment of Pharmaceutics, College of Pharmacy, King Saud UniversityDepartment of Pharmaceutical Chemistry, College of Pharmacy, King Saud UniversityAbstract This study presents a comprehensive approach to predicting solubility of recombinant protein in four E. coli samples by employing machine learning techniques and optimization algorithms. Various models, including AdaBoost, Decision Tree Regression (DT), Gaussian Process Regression (GPR), and K-Nearest Neighbors (KNN) are applied to capture the intricate relationships between experimental factors and protein solubility. The integration of these models within an AdaBoost framework, coupled with advanced hyperparameter tuning via the Firefly Algorithm (FA), demonstrates a novel strategy for improving predictive accuracy and model robustness. Key preprocessing techniques such as Histogram-Based Outlier Detection (HBOD) and Z-score normalization are employed to ensure data integrity and consistency. The Firefly Algorithm (FA), utilizing 5-fold cross-validation as the fitness function, adeptly navigates complex hyperparameter spaces, enhancing model performance across diverse data partitions. The AdaBoost with Gaussian Process Regression (ADA-GPR) model established to be superior to alternatives including ADA-DT and ADA-KNN, demonstrating great performance through high R2 test scores and low Mean Squared Error. With a standard deviation of 0.05188 across 5-fold cross-validation, ADA-GPR demonstrated exceptional consistency and robust generalization across diverse data partitions. Using hybrid optimization, this study sheds light on critical variables influencing protein solubility, providing a scalable and effective solution for modeling bioprocesses.https://doi.org/10.1038/s41598-025-97445-xRecombinant proteinSolubilityDecision tree regressionK-Nearest neighborsGaussian process regression |
| spellingShingle | Wael A. Mahdi Adel Alhowyan Ahmad J. Obaidullah Estimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning models Scientific Reports Recombinant protein Solubility Decision tree regression K-Nearest neighbors Gaussian process regression |
| title | Estimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning models |
| title_full | Estimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning models |
| title_fullStr | Estimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning models |
| title_full_unstemmed | Estimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning models |
| title_short | Estimation and validation of solubility of recombinant protein in E. coli strains via various advanced machine learning models |
| title_sort | estimation and validation of solubility of recombinant protein in e coli strains via various advanced machine learning models |
| topic | Recombinant protein Solubility Decision tree regression K-Nearest neighbors Gaussian process regression |
| url | https://doi.org/10.1038/s41598-025-97445-x |
| work_keys_str_mv | AT waelamahdi estimationandvalidationofsolubilityofrecombinantproteininecolistrainsviavariousadvancedmachinelearningmodels AT adelalhowyan estimationandvalidationofsolubilityofrecombinantproteininecolistrainsviavariousadvancedmachinelearningmodels AT ahmadjobaidullah estimationandvalidationofsolubilityofrecombinantproteininecolistrainsviavariousadvancedmachinelearningmodels |