Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study
Purpose This study developed and evaluated a feature-based ensemble model integrating the synthetic minority oversampling technique (SMOTE) and random undersampling (RUS) methods with a random forest approach to address class imbalance in machine learning for early diabetes detection, aiming to impr...
Saved in:
| Main Author: | |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Ewha Womans University College of Medicine
2025-04-01
|
| Series: | The Ewha Medical Journal |
| Subjects: | |
| Online Access: | http://www.e-emj.org/upload/pdf/emj-2025-00353.pdf |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849223020552388608 |
|---|---|
| author | Younseo Jang |
| author_facet | Younseo Jang |
| author_sort | Younseo Jang |
| collection | DOAJ |
| description | Purpose This study developed and evaluated a feature-based ensemble model integrating the synthetic minority oversampling technique (SMOTE) and random undersampling (RUS) methods with a random forest approach to address class imbalance in machine learning for early diabetes detection, aiming to improve predictive performance. Methods Using the Scikit-learn diabetes dataset (442 samples, 10 features), we binarized the target variable (diabetes progression) at the 75th percentile and split it 80:20 using stratified sampling. The training set was balanced to a 1:2 minority-to-majority ratio via SMOTE (0.6) and RUS (0.66). A feature-based ensemble model was constructed by training random forest classifiers on 10 two-feature subsets, selected based on feature importance, and combining their outputs using soft voting. Performance was compared against 13 baseline models, using accuracy and area under the curve (AUC) as metrics on the imbalanced test set. Results The feature-based ensemble model and balanced random forest both achieved the highest accuracy (0.8764), followed by the fully connected neural network (0.8700). The ensemble model had an excellent AUC (0.9227), while k-nearest neighbors had the lowest accuracy (0.8427). Visualizations confirmed its superior discriminative ability, especially for the minority (high-risk) class, which is a critical factor in medical contexts. Conclusion Integrating SMOTE, RUS, and feature-based ensemble learning improved classification performance in imbalanced diabetes datasets by delivering robust accuracy and high recall for the minority class. This approach outperforms traditional resampling techniques and deep learning models, offering a scalable and interpretable solution for early diabetes prediction and potentially other medical applications. |
| format | Article |
| id | doaj-art-70b25006cbe54e84848145dee521f4e5 |
| institution | Kabale University |
| issn | 2234-3180 2234-2591 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | Ewha Womans University College of Medicine |
| record_format | Article |
| series | The Ewha Medical Journal |
| spelling | doaj-art-70b25006cbe54e84848145dee521f4e52025-08-26T00:04:46ZengEwha Womans University College of MedicineThe Ewha Medical Journal2234-31802234-25912025-04-0148210.12771/emj.2025.003531620Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction studyYounseo Jang0College of Medicine, Ewha Womans University, Seoul, KoreaPurpose This study developed and evaluated a feature-based ensemble model integrating the synthetic minority oversampling technique (SMOTE) and random undersampling (RUS) methods with a random forest approach to address class imbalance in machine learning for early diabetes detection, aiming to improve predictive performance. Methods Using the Scikit-learn diabetes dataset (442 samples, 10 features), we binarized the target variable (diabetes progression) at the 75th percentile and split it 80:20 using stratified sampling. The training set was balanced to a 1:2 minority-to-majority ratio via SMOTE (0.6) and RUS (0.66). A feature-based ensemble model was constructed by training random forest classifiers on 10 two-feature subsets, selected based on feature importance, and combining their outputs using soft voting. Performance was compared against 13 baseline models, using accuracy and area under the curve (AUC) as metrics on the imbalanced test set. Results The feature-based ensemble model and balanced random forest both achieved the highest accuracy (0.8764), followed by the fully connected neural network (0.8700). The ensemble model had an excellent AUC (0.9227), while k-nearest neighbors had the lowest accuracy (0.8427). Visualizations confirmed its superior discriminative ability, especially for the minority (high-risk) class, which is a critical factor in medical contexts. Conclusion Integrating SMOTE, RUS, and feature-based ensemble learning improved classification performance in imbalanced diabetes datasets by delivering robust accuracy and high recall for the minority class. This approach outperforms traditional resampling techniques and deep learning models, offering a scalable and interpretable solution for early diabetes prediction and potentially other medical applications.http://www.e-emj.org/upload/pdf/emj-2025-00353.pdfarea under curvecomputer neural networksdeep learningdiabetes mellitusrandom forest |
| spellingShingle | Younseo Jang Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study The Ewha Medical Journal area under curve computer neural networks deep learning diabetes mellitus random forest |
| title | Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study |
| title_full | Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study |
| title_fullStr | Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study |
| title_full_unstemmed | Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study |
| title_short | Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study |
| title_sort | feature based ensemble modeling for addressing diabetes data imbalance using the smote rus and random forest methods a prediction study |
| topic | area under curve computer neural networks deep learning diabetes mellitus random forest |
| url | http://www.e-emj.org/upload/pdf/emj-2025-00353.pdf |
| work_keys_str_mv | AT younseojang featurebasedensemblemodelingforaddressingdiabetesdataimbalanceusingthesmoterusandrandomforestmethodsapredictionstudy |