Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels

This study examines the efficacy of Random Forest and XGBoost classifiers in conjunction with three upsampling techniques—SMOTE, ADASYN, and Gaussian noise upsampling (GNUS)—across datasets with varying class imbalance levels, ranging from moderate to extreme (15% to 1% churn rate). Employing metric...

Full description

Saved in:
Bibliographic Details
Main Authors: Mehdi Imani, Ali Beikmohammadi, Hamid Reza Arabnia
Format: Article
Language:English
Published: MDPI AG 2025-02-01
Series:Technologies
Subjects:
Online Access:https://www.mdpi.com/2227-7080/13/3/88
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849339792121135104
author Mehdi Imani
Ali Beikmohammadi
Hamid Reza Arabnia
author_facet Mehdi Imani
Ali Beikmohammadi
Hamid Reza Arabnia
author_sort Mehdi Imani
collection DOAJ
description This study examines the efficacy of Random Forest and XGBoost classifiers in conjunction with three upsampling techniques—SMOTE, ADASYN, and Gaussian noise upsampling (GNUS)—across datasets with varying class imbalance levels, ranging from moderate to extreme (15% to 1% churn rate). Employing metrics such as F1 score, ROC AUC, PR AUC, Matthews Correlation Coefficient (MCC), and Cohen’s Kappa, this research provides a comprehensive evaluation of classifier performance under different imbalance scenarios, focusing on applications in the telecommunications domain. The findings highlight that tuned XGBoost paired with SMOTE (Tuned_XGB_SMOTE) consistently achieves the highest F1 score and robust performance across all imbalance levels. SMOTE emerged as the most effective upsampling method, particularly when used with XGBoost, whereas Random Forest performed poorly under severe imbalance. ADASYN showed moderate effectiveness with XGBoost but underperformed with Random Forest, and GNUS produced inconsistent results. This study underscores the impact of data imbalance, with MCC, Kappa, and F1 scores fluctuating significantly, whereas ROC AUC and PR AUC remained relatively stable. Moreover, rigorous statistical analyses employing the Friedman test and Nemenyi post hoc comparisons confirmed that the observed improvements in F1 score, PR-AUC, Kappa, and MCC were statistically significant (<i>p</i> < 0.05), with Tuned_XGB_SMOTE significantly outperforming Tuned_RF_GNUS. While differences in ROC-AUC were not significant, the consistency of these results across multiple performance metrics underscores the reliability of our framework, offering a statistically validated and attractive solution for model selection in imbalanced classification scenarios.
format Article
id doaj-art-9df61f0f73d94dfcbcf97d59e0955c4e
institution Kabale University
issn 2227-7080
language English
publishDate 2025-02-01
publisher MDPI AG
record_format Article
series Technologies
spelling doaj-art-9df61f0f73d94dfcbcf97d59e0955c4e2025-08-20T03:44:03ZengMDPI AGTechnologies2227-70802025-02-011338810.3390/technologies13030088Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance LevelsMehdi Imani0Ali Beikmohammadi1Hamid Reza Arabnia2Department of Computer and Systems Sciences, Stockholm University, SE-16455 Stockholm, SwedenDepartment of Computer and Systems Sciences, Stockholm University, SE-16455 Stockholm, SwedenSchool of Computing, University of Georgia, Athens, GA 30602, USAThis study examines the efficacy of Random Forest and XGBoost classifiers in conjunction with three upsampling techniques—SMOTE, ADASYN, and Gaussian noise upsampling (GNUS)—across datasets with varying class imbalance levels, ranging from moderate to extreme (15% to 1% churn rate). Employing metrics such as F1 score, ROC AUC, PR AUC, Matthews Correlation Coefficient (MCC), and Cohen’s Kappa, this research provides a comprehensive evaluation of classifier performance under different imbalance scenarios, focusing on applications in the telecommunications domain. The findings highlight that tuned XGBoost paired with SMOTE (Tuned_XGB_SMOTE) consistently achieves the highest F1 score and robust performance across all imbalance levels. SMOTE emerged as the most effective upsampling method, particularly when used with XGBoost, whereas Random Forest performed poorly under severe imbalance. ADASYN showed moderate effectiveness with XGBoost but underperformed with Random Forest, and GNUS produced inconsistent results. This study underscores the impact of data imbalance, with MCC, Kappa, and F1 scores fluctuating significantly, whereas ROC AUC and PR AUC remained relatively stable. Moreover, rigorous statistical analyses employing the Friedman test and Nemenyi post hoc comparisons confirmed that the observed improvements in F1 score, PR-AUC, Kappa, and MCC were statistically significant (<i>p</i> < 0.05), with Tuned_XGB_SMOTE significantly outperforming Tuned_RF_GNUS. While differences in ROC-AUC were not significant, the consistency of these results across multiple performance metrics underscores the reliability of our framework, offering a statistically validated and attractive solution for model selection in imbalanced classification scenarios.https://www.mdpi.com/2227-7080/13/3/88customer churn predictiondifferent imbalance ratesXGBoostrandom forestupsampling techniques
spellingShingle Mehdi Imani
Ali Beikmohammadi
Hamid Reza Arabnia
Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels
Technologies
customer churn prediction
different imbalance rates
XGBoost
random forest
upsampling techniques
title Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels
title_full Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels
title_fullStr Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels
title_full_unstemmed Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels
title_short Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels
title_sort comprehensive analysis of random forest and xgboost performance with smote adasyn and gnus under varying imbalance levels
topic customer churn prediction
different imbalance rates
XGBoost
random forest
upsampling techniques
url https://www.mdpi.com/2227-7080/13/3/88
work_keys_str_mv AT mehdiimani comprehensiveanalysisofrandomforestandxgboostperformancewithsmoteadasynandgnusundervaryingimbalancelevels
AT alibeikmohammadi comprehensiveanalysisofrandomforestandxgboostperformancewithsmoteadasynandgnusundervaryingimbalancelevels
AT hamidrezaarabnia comprehensiveanalysisofrandomforestandxgboostperformancewithsmoteadasynandgnusundervaryingimbalancelevels