Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels
This study examines the efficacy of Random Forest and XGBoost classifiers in conjunction with three upsampling techniques—SMOTE, ADASYN, and Gaussian noise upsampling (GNUS)—across datasets with varying class imbalance levels, ranging from moderate to extreme (15% to 1% churn rate). Employing metric...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-02-01
|
| Series: | Technologies |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2227-7080/13/3/88 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849339792121135104 |
|---|---|
| author | Mehdi Imani Ali Beikmohammadi Hamid Reza Arabnia |
| author_facet | Mehdi Imani Ali Beikmohammadi Hamid Reza Arabnia |
| author_sort | Mehdi Imani |
| collection | DOAJ |
| description | This study examines the efficacy of Random Forest and XGBoost classifiers in conjunction with three upsampling techniques—SMOTE, ADASYN, and Gaussian noise upsampling (GNUS)—across datasets with varying class imbalance levels, ranging from moderate to extreme (15% to 1% churn rate). Employing metrics such as F1 score, ROC AUC, PR AUC, Matthews Correlation Coefficient (MCC), and Cohen’s Kappa, this research provides a comprehensive evaluation of classifier performance under different imbalance scenarios, focusing on applications in the telecommunications domain. The findings highlight that tuned XGBoost paired with SMOTE (Tuned_XGB_SMOTE) consistently achieves the highest F1 score and robust performance across all imbalance levels. SMOTE emerged as the most effective upsampling method, particularly when used with XGBoost, whereas Random Forest performed poorly under severe imbalance. ADASYN showed moderate effectiveness with XGBoost but underperformed with Random Forest, and GNUS produced inconsistent results. This study underscores the impact of data imbalance, with MCC, Kappa, and F1 scores fluctuating significantly, whereas ROC AUC and PR AUC remained relatively stable. Moreover, rigorous statistical analyses employing the Friedman test and Nemenyi post hoc comparisons confirmed that the observed improvements in F1 score, PR-AUC, Kappa, and MCC were statistically significant (<i>p</i> < 0.05), with Tuned_XGB_SMOTE significantly outperforming Tuned_RF_GNUS. While differences in ROC-AUC were not significant, the consistency of these results across multiple performance metrics underscores the reliability of our framework, offering a statistically validated and attractive solution for model selection in imbalanced classification scenarios. |
| format | Article |
| id | doaj-art-9df61f0f73d94dfcbcf97d59e0955c4e |
| institution | Kabale University |
| issn | 2227-7080 |
| language | English |
| publishDate | 2025-02-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Technologies |
| spelling | doaj-art-9df61f0f73d94dfcbcf97d59e0955c4e2025-08-20T03:44:03ZengMDPI AGTechnologies2227-70802025-02-011338810.3390/technologies13030088Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance LevelsMehdi Imani0Ali Beikmohammadi1Hamid Reza Arabnia2Department of Computer and Systems Sciences, Stockholm University, SE-16455 Stockholm, SwedenDepartment of Computer and Systems Sciences, Stockholm University, SE-16455 Stockholm, SwedenSchool of Computing, University of Georgia, Athens, GA 30602, USAThis study examines the efficacy of Random Forest and XGBoost classifiers in conjunction with three upsampling techniques—SMOTE, ADASYN, and Gaussian noise upsampling (GNUS)—across datasets with varying class imbalance levels, ranging from moderate to extreme (15% to 1% churn rate). Employing metrics such as F1 score, ROC AUC, PR AUC, Matthews Correlation Coefficient (MCC), and Cohen’s Kappa, this research provides a comprehensive evaluation of classifier performance under different imbalance scenarios, focusing on applications in the telecommunications domain. The findings highlight that tuned XGBoost paired with SMOTE (Tuned_XGB_SMOTE) consistently achieves the highest F1 score and robust performance across all imbalance levels. SMOTE emerged as the most effective upsampling method, particularly when used with XGBoost, whereas Random Forest performed poorly under severe imbalance. ADASYN showed moderate effectiveness with XGBoost but underperformed with Random Forest, and GNUS produced inconsistent results. This study underscores the impact of data imbalance, with MCC, Kappa, and F1 scores fluctuating significantly, whereas ROC AUC and PR AUC remained relatively stable. Moreover, rigorous statistical analyses employing the Friedman test and Nemenyi post hoc comparisons confirmed that the observed improvements in F1 score, PR-AUC, Kappa, and MCC were statistically significant (<i>p</i> < 0.05), with Tuned_XGB_SMOTE significantly outperforming Tuned_RF_GNUS. While differences in ROC-AUC were not significant, the consistency of these results across multiple performance metrics underscores the reliability of our framework, offering a statistically validated and attractive solution for model selection in imbalanced classification scenarios.https://www.mdpi.com/2227-7080/13/3/88customer churn predictiondifferent imbalance ratesXGBoostrandom forestupsampling techniques |
| spellingShingle | Mehdi Imani Ali Beikmohammadi Hamid Reza Arabnia Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels Technologies customer churn prediction different imbalance rates XGBoost random forest upsampling techniques |
| title | Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels |
| title_full | Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels |
| title_fullStr | Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels |
| title_full_unstemmed | Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels |
| title_short | Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels |
| title_sort | comprehensive analysis of random forest and xgboost performance with smote adasyn and gnus under varying imbalance levels |
| topic | customer churn prediction different imbalance rates XGBoost random forest upsampling techniques |
| url | https://www.mdpi.com/2227-7080/13/3/88 |
| work_keys_str_mv | AT mehdiimani comprehensiveanalysisofrandomforestandxgboostperformancewithsmoteadasynandgnusundervaryingimbalancelevels AT alibeikmohammadi comprehensiveanalysisofrandomforestandxgboostperformancewithsmoteadasynandgnusundervaryingimbalancelevels AT hamidrezaarabnia comprehensiveanalysisofrandomforestandxgboostperformancewithsmoteadasynandgnusundervaryingimbalancelevels |