Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study

Purpose This study developed and evaluated a feature-based ensemble model integrating the synthetic minority oversampling technique (SMOTE) and random undersampling (RUS) methods with a random forest approach to address class imbalance in machine learning for early diabetes detection, aiming to impr...

Full description

Saved in:
Bibliographic Details
Main Author: Younseo Jang
Format: Article
Language:English
Published: Ewha Womans University College of Medicine 2025-04-01
Series:The Ewha Medical Journal
Subjects:
Online Access:http://www.e-emj.org/upload/pdf/emj-2025-00353.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849223020552388608
author Younseo Jang
author_facet Younseo Jang
author_sort Younseo Jang
collection DOAJ
description Purpose This study developed and evaluated a feature-based ensemble model integrating the synthetic minority oversampling technique (SMOTE) and random undersampling (RUS) methods with a random forest approach to address class imbalance in machine learning for early diabetes detection, aiming to improve predictive performance. Methods Using the Scikit-learn diabetes dataset (442 samples, 10 features), we binarized the target variable (diabetes progression) at the 75th percentile and split it 80:20 using stratified sampling. The training set was balanced to a 1:2 minority-to-majority ratio via SMOTE (0.6) and RUS (0.66). A feature-based ensemble model was constructed by training random forest classifiers on 10 two-feature subsets, selected based on feature importance, and combining their outputs using soft voting. Performance was compared against 13 baseline models, using accuracy and area under the curve (AUC) as metrics on the imbalanced test set. Results The feature-based ensemble model and balanced random forest both achieved the highest accuracy (0.8764), followed by the fully connected neural network (0.8700). The ensemble model had an excellent AUC (0.9227), while k-nearest neighbors had the lowest accuracy (0.8427). Visualizations confirmed its superior discriminative ability, especially for the minority (high-risk) class, which is a critical factor in medical contexts. Conclusion Integrating SMOTE, RUS, and feature-based ensemble learning improved classification performance in imbalanced diabetes datasets by delivering robust accuracy and high recall for the minority class. This approach outperforms traditional resampling techniques and deep learning models, offering a scalable and interpretable solution for early diabetes prediction and potentially other medical applications.
format Article
id doaj-art-70b25006cbe54e84848145dee521f4e5
institution Kabale University
issn 2234-3180
2234-2591
language English
publishDate 2025-04-01
publisher Ewha Womans University College of Medicine
record_format Article
series The Ewha Medical Journal
spelling doaj-art-70b25006cbe54e84848145dee521f4e52025-08-26T00:04:46ZengEwha Womans University College of MedicineThe Ewha Medical Journal2234-31802234-25912025-04-0148210.12771/emj.2025.003531620Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction studyYounseo Jang0College of Medicine, Ewha Womans University, Seoul, KoreaPurpose This study developed and evaluated a feature-based ensemble model integrating the synthetic minority oversampling technique (SMOTE) and random undersampling (RUS) methods with a random forest approach to address class imbalance in machine learning for early diabetes detection, aiming to improve predictive performance. Methods Using the Scikit-learn diabetes dataset (442 samples, 10 features), we binarized the target variable (diabetes progression) at the 75th percentile and split it 80:20 using stratified sampling. The training set was balanced to a 1:2 minority-to-majority ratio via SMOTE (0.6) and RUS (0.66). A feature-based ensemble model was constructed by training random forest classifiers on 10 two-feature subsets, selected based on feature importance, and combining their outputs using soft voting. Performance was compared against 13 baseline models, using accuracy and area under the curve (AUC) as metrics on the imbalanced test set. Results The feature-based ensemble model and balanced random forest both achieved the highest accuracy (0.8764), followed by the fully connected neural network (0.8700). The ensemble model had an excellent AUC (0.9227), while k-nearest neighbors had the lowest accuracy (0.8427). Visualizations confirmed its superior discriminative ability, especially for the minority (high-risk) class, which is a critical factor in medical contexts. Conclusion Integrating SMOTE, RUS, and feature-based ensemble learning improved classification performance in imbalanced diabetes datasets by delivering robust accuracy and high recall for the minority class. This approach outperforms traditional resampling techniques and deep learning models, offering a scalable and interpretable solution for early diabetes prediction and potentially other medical applications.http://www.e-emj.org/upload/pdf/emj-2025-00353.pdfarea under curvecomputer neural networksdeep learningdiabetes mellitusrandom forest
spellingShingle Younseo Jang
Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study
The Ewha Medical Journal
area under curve
computer neural networks
deep learning
diabetes mellitus
random forest
title Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study
title_full Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study
title_fullStr Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study
title_full_unstemmed Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study
title_short Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study
title_sort feature based ensemble modeling for addressing diabetes data imbalance using the smote rus and random forest methods a prediction study
topic area under curve
computer neural networks
deep learning
diabetes mellitus
random forest
url http://www.e-emj.org/upload/pdf/emj-2025-00353.pdf
work_keys_str_mv AT younseojang featurebasedensemblemodelingforaddressingdiabetesdataimbalanceusingthesmoterusandrandomforestmethodsapredictionstudy