Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: Comparative Study

Abstract BackgroundPrediabetes is an intermediate stage between normal glucose metabolism and diabetes and is associated with increased risk of complications like cardiovascular disease and kidney failure. ObjectiveIt is crucial to recognize individuals with predia...

Full description

Saved in:
Bibliographic Details
Main Authors: Mahmoud B Almadhoun, MA Burhanuddin
Format: Article
Language:English
Published: JMIR Publications 2025-07-01
Series:JMIR Bioinformatics and Biotechnology
Online Access:https://bioinform.jmir.org/2025/1/e70621
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849239116536872960
author Mahmoud B Almadhoun
MA Burhanuddin
author_facet Mahmoud B Almadhoun
MA Burhanuddin
author_sort Mahmoud B Almadhoun
collection DOAJ
description Abstract BackgroundPrediabetes is an intermediate stage between normal glucose metabolism and diabetes and is associated with increased risk of complications like cardiovascular disease and kidney failure. ObjectiveIt is crucial to recognize individuals with prediabetes early in order to apply timely intervention strategies to decelerate or prohibit diabetes development. This study aims to compare the effectiveness of machine learning (ML) algorithms in predicting prediabetes and identifying its key clinical predictors. MethodsMultiple ML models are evaluated in this study, including random forest, extreme gradient boosting (XGBoost), support vector machine (SVM), and k. ResultsA cross-validated ROC-AUC (receiver operating characteristic area under the curve) score of 0.9117 highlighted the robustness of random forest in generalizing across datasets among the models tested. XGBoost followed closely, providing balanced accuracy in distinguishing between normal and prediabetic cases. While SVMs and KNNs performed adequately as baseline models, they exhibited limitations in sensitivity. The SHAP analysis indicated that BMI, age, high-density lipoprotein cholesterol, and low-density lipoprotein cholesterol emerged as the key predictors across models. The performance was significantly enhanced through hyperparameter tuning; for example, the ROC-AUC for SVM increased from 0.813 (default) to 0.863 (tuned). PCA kept 12 components while maintaining 95% of the variance in the dataset. ConclusionsIt is demonstrated in this research that optimized ML models, especially random forest and XGBoost, are effective tools for assessing early prediabetes risk. Combining SHAP analysis with LASSO and PCA enhances transparency, supporting their integration in real-time clinical decision support systems. Future directions include validating these models in diverse clinical settings and integrating additional biomarkers to improve prediction accuracy, offering a promising avenue for early intervention and personalized treatment strategies in preventive health care.
format Article
id doaj-art-d561e75d2a8f4d02bcd88fd26c9d1871
institution Kabale University
issn 2563-3570
language English
publishDate 2025-07-01
publisher JMIR Publications
record_format Article
series JMIR Bioinformatics and Biotechnology
spelling doaj-art-d561e75d2a8f4d02bcd88fd26c9d18712025-08-20T04:01:09ZengJMIR PublicationsJMIR Bioinformatics and Biotechnology2563-35702025-07-016e70621e7062110.2196/70621Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: Comparative StudyMahmoud B Almadhounhttp://orcid.org/0009-0001-3734-8735MA Burhanuddinhttp://orcid.org/0000-0001-8976-7416 Abstract BackgroundPrediabetes is an intermediate stage between normal glucose metabolism and diabetes and is associated with increased risk of complications like cardiovascular disease and kidney failure. ObjectiveIt is crucial to recognize individuals with prediabetes early in order to apply timely intervention strategies to decelerate or prohibit diabetes development. This study aims to compare the effectiveness of machine learning (ML) algorithms in predicting prediabetes and identifying its key clinical predictors. MethodsMultiple ML models are evaluated in this study, including random forest, extreme gradient boosting (XGBoost), support vector machine (SVM), and k. ResultsA cross-validated ROC-AUC (receiver operating characteristic area under the curve) score of 0.9117 highlighted the robustness of random forest in generalizing across datasets among the models tested. XGBoost followed closely, providing balanced accuracy in distinguishing between normal and prediabetic cases. While SVMs and KNNs performed adequately as baseline models, they exhibited limitations in sensitivity. The SHAP analysis indicated that BMI, age, high-density lipoprotein cholesterol, and low-density lipoprotein cholesterol emerged as the key predictors across models. The performance was significantly enhanced through hyperparameter tuning; for example, the ROC-AUC for SVM increased from 0.813 (default) to 0.863 (tuned). PCA kept 12 components while maintaining 95% of the variance in the dataset. ConclusionsIt is demonstrated in this research that optimized ML models, especially random forest and XGBoost, are effective tools for assessing early prediabetes risk. Combining SHAP analysis with LASSO and PCA enhances transparency, supporting their integration in real-time clinical decision support systems. Future directions include validating these models in diverse clinical settings and integrating additional biomarkers to improve prediction accuracy, offering a promising avenue for early intervention and personalized treatment strategies in preventive health care.https://bioinform.jmir.org/2025/1/e70621
spellingShingle Mahmoud B Almadhoun
MA Burhanuddin
Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: Comparative Study
JMIR Bioinformatics and Biotechnology
title Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: Comparative Study
title_full Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: Comparative Study
title_fullStr Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: Comparative Study
title_full_unstemmed Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: Comparative Study
title_short Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: Comparative Study
title_sort optimizing feature selection and machine learning algorithms for early detection of prediabetes risk comparative study
url https://bioinform.jmir.org/2025/1/e70621
work_keys_str_mv AT mahmoudbalmadhoun optimizingfeatureselectionandmachinelearningalgorithmsforearlydetectionofprediabetesriskcomparativestudy
AT maburhanuddin optimizingfeatureselectionandmachinelearningalgorithmsforearlydetectionofprediabetesriskcomparativestudy