An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance

Researchers using machine learning methods for classification can face challenges due to class imbalance, where a certain class is underrepresented. Over or under-sampling of minority or majority class observations, or solely relying on model selection for ensemble methods, may prove ineffective whe...

Full description

Saved in:
Bibliographic Details
Main Authors: Samir K. Safi, Sheema Gul
Format: Article
Language:English
Published: MDPI AG 2024-10-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/12/20/3243
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850206132054786048
author Samir K. Safi
Sheema Gul
author_facet Samir K. Safi
Sheema Gul
author_sort Samir K. Safi
collection DOAJ
description Researchers using machine learning methods for classification can face challenges due to class imbalance, where a certain class is underrepresented. Over or under-sampling of minority or majority class observations, or solely relying on model selection for ensemble methods, may prove ineffective when the class imbalance ratio is extremely high. To address this issue, this paper proposes a method called enhance tree ensemble (ETE), based on generating synthetic data for minority class observations in conjunction with tree selection based on their performance on the training data. The proposed method first generates minority class instances to balance the training data and then uses the idea of tree selection by leveraging out-of-bag (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">E</mi><mi mathvariant="normal">T</mi><mi mathvariant="normal">E</mi></mrow><mrow><mi mathvariant="normal">O</mi><mi mathvariant="normal">O</mi><mi mathvariant="normal">B</mi></mrow></msub></mrow></semantics></math></inline-formula>) and sub-samples (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">E</mi><mi mathvariant="normal">T</mi><mi mathvariant="normal">E</mi></mrow><mrow><mi mathvariant="normal">S</mi><mi mathvariant="normal">S</mi></mrow></msub></mrow></semantics></math></inline-formula>) observations, respectively. The efficacy of the proposed method is assessed using twenty benchmark problems for binary classification with moderate to extreme class imbalance, comparing it against other well-known methods such as optimal tree ensemble (OTE), SMOTE random forest (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi>R</mi><mi>F</mi></mrow><mrow><mi>S</mi><mi>M</mi><mi>O</mi><mi>T</mi><mi>E</mi></mrow></msub></mrow></semantics></math></inline-formula>), oversampling random forest (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">R</mi><mi mathvariant="normal">F</mi></mrow><mrow><mi mathvariant="normal">O</mi><mi mathvariant="normal">S</mi></mrow></msub></mrow></semantics></math></inline-formula>), under-sampling random forest (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">R</mi><mi mathvariant="normal">F</mi></mrow><mrow><mi mathvariant="normal">U</mi><mi mathvariant="normal">S</mi></mrow></msub></mrow></semantics></math></inline-formula>), k-nearest neighbor (k-NN), support vector machine (SVM), tree, and artificial neural network (ANN). Performance metrics such as classification error rate and precision are used for evaluation purposes. The analyses of the study revealed that the proposed method, based on data balancing and model selection, yielded better results than the other methods.
format Article
id doaj-art-bca6dae2a2ad4eb093099be39b7dbe0a
institution OA Journals
issn 2227-7390
language English
publishDate 2024-10-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj-art-bca6dae2a2ad4eb093099be39b7dbe0a2025-08-20T02:10:56ZengMDPI AGMathematics2227-73902024-10-011220324310.3390/math12203243An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class ImbalanceSamir K. Safi0Sheema Gul1Department of Statistics and Business Analytics, College of Business and Economics, United Arab Emirates University, Al Ain P.O. Box 15551, United Arab EmiratesDepartment of Statistics and Business Analytics, College of Business and Economics, United Arab Emirates University, Al Ain P.O. Box 15551, United Arab EmiratesResearchers using machine learning methods for classification can face challenges due to class imbalance, where a certain class is underrepresented. Over or under-sampling of minority or majority class observations, or solely relying on model selection for ensemble methods, may prove ineffective when the class imbalance ratio is extremely high. To address this issue, this paper proposes a method called enhance tree ensemble (ETE), based on generating synthetic data for minority class observations in conjunction with tree selection based on their performance on the training data. The proposed method first generates minority class instances to balance the training data and then uses the idea of tree selection by leveraging out-of-bag (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">E</mi><mi mathvariant="normal">T</mi><mi mathvariant="normal">E</mi></mrow><mrow><mi mathvariant="normal">O</mi><mi mathvariant="normal">O</mi><mi mathvariant="normal">B</mi></mrow></msub></mrow></semantics></math></inline-formula>) and sub-samples (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">E</mi><mi mathvariant="normal">T</mi><mi mathvariant="normal">E</mi></mrow><mrow><mi mathvariant="normal">S</mi><mi mathvariant="normal">S</mi></mrow></msub></mrow></semantics></math></inline-formula>) observations, respectively. The efficacy of the proposed method is assessed using twenty benchmark problems for binary classification with moderate to extreme class imbalance, comparing it against other well-known methods such as optimal tree ensemble (OTE), SMOTE random forest (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi>R</mi><mi>F</mi></mrow><mrow><mi>S</mi><mi>M</mi><mi>O</mi><mi>T</mi><mi>E</mi></mrow></msub></mrow></semantics></math></inline-formula>), oversampling random forest (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">R</mi><mi mathvariant="normal">F</mi></mrow><mrow><mi mathvariant="normal">O</mi><mi mathvariant="normal">S</mi></mrow></msub></mrow></semantics></math></inline-formula>), under-sampling random forest (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">R</mi><mi mathvariant="normal">F</mi></mrow><mrow><mi mathvariant="normal">U</mi><mi mathvariant="normal">S</mi></mrow></msub></mrow></semantics></math></inline-formula>), k-nearest neighbor (k-NN), support vector machine (SVM), tree, and artificial neural network (ANN). Performance metrics such as classification error rate and precision are used for evaluation purposes. The analyses of the study revealed that the proposed method, based on data balancing and model selection, yielded better results than the other methods.https://www.mdpi.com/2227-7390/12/20/3243random foresttree selectionclassificationclass-imbalance problemsynthetic data generation
spellingShingle Samir K. Safi
Sheema Gul
An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance
Mathematics
random forest
tree selection
classification
class-imbalance problem
synthetic data generation
title An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance
title_full An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance
title_fullStr An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance
title_full_unstemmed An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance
title_short An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance
title_sort enhanced tree ensemble for classification in the presence of extreme class imbalance
topic random forest
tree selection
classification
class-imbalance problem
synthetic data generation
url https://www.mdpi.com/2227-7390/12/20/3243
work_keys_str_mv AT samirksafi anenhancedtreeensembleforclassificationinthepresenceofextremeclassimbalance
AT sheemagul anenhancedtreeensembleforclassificationinthepresenceofextremeclassimbalance
AT samirksafi enhancedtreeensembleforclassificationinthepresenceofextremeclassimbalance
AT sheemagul enhancedtreeensembleforclassificationinthepresenceofextremeclassimbalance