An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance
Researchers using machine learning methods for classification can face challenges due to class imbalance, where a certain class is underrepresented. Over or under-sampling of minority or majority class observations, or solely relying on model selection for ensemble methods, may prove ineffective whe...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2024-10-01
|
| Series: | Mathematics |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2227-7390/12/20/3243 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850206132054786048 |
|---|---|
| author | Samir K. Safi Sheema Gul |
| author_facet | Samir K. Safi Sheema Gul |
| author_sort | Samir K. Safi |
| collection | DOAJ |
| description | Researchers using machine learning methods for classification can face challenges due to class imbalance, where a certain class is underrepresented. Over or under-sampling of minority or majority class observations, or solely relying on model selection for ensemble methods, may prove ineffective when the class imbalance ratio is extremely high. To address this issue, this paper proposes a method called enhance tree ensemble (ETE), based on generating synthetic data for minority class observations in conjunction with tree selection based on their performance on the training data. The proposed method first generates minority class instances to balance the training data and then uses the idea of tree selection by leveraging out-of-bag (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">E</mi><mi mathvariant="normal">T</mi><mi mathvariant="normal">E</mi></mrow><mrow><mi mathvariant="normal">O</mi><mi mathvariant="normal">O</mi><mi mathvariant="normal">B</mi></mrow></msub></mrow></semantics></math></inline-formula>) and sub-samples (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">E</mi><mi mathvariant="normal">T</mi><mi mathvariant="normal">E</mi></mrow><mrow><mi mathvariant="normal">S</mi><mi mathvariant="normal">S</mi></mrow></msub></mrow></semantics></math></inline-formula>) observations, respectively. The efficacy of the proposed method is assessed using twenty benchmark problems for binary classification with moderate to extreme class imbalance, comparing it against other well-known methods such as optimal tree ensemble (OTE), SMOTE random forest (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi>R</mi><mi>F</mi></mrow><mrow><mi>S</mi><mi>M</mi><mi>O</mi><mi>T</mi><mi>E</mi></mrow></msub></mrow></semantics></math></inline-formula>), oversampling random forest (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">R</mi><mi mathvariant="normal">F</mi></mrow><mrow><mi mathvariant="normal">O</mi><mi mathvariant="normal">S</mi></mrow></msub></mrow></semantics></math></inline-formula>), under-sampling random forest (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">R</mi><mi mathvariant="normal">F</mi></mrow><mrow><mi mathvariant="normal">U</mi><mi mathvariant="normal">S</mi></mrow></msub></mrow></semantics></math></inline-formula>), k-nearest neighbor (k-NN), support vector machine (SVM), tree, and artificial neural network (ANN). Performance metrics such as classification error rate and precision are used for evaluation purposes. The analyses of the study revealed that the proposed method, based on data balancing and model selection, yielded better results than the other methods. |
| format | Article |
| id | doaj-art-bca6dae2a2ad4eb093099be39b7dbe0a |
| institution | OA Journals |
| issn | 2227-7390 |
| language | English |
| publishDate | 2024-10-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Mathematics |
| spelling | doaj-art-bca6dae2a2ad4eb093099be39b7dbe0a2025-08-20T02:10:56ZengMDPI AGMathematics2227-73902024-10-011220324310.3390/math12203243An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class ImbalanceSamir K. Safi0Sheema Gul1Department of Statistics and Business Analytics, College of Business and Economics, United Arab Emirates University, Al Ain P.O. Box 15551, United Arab EmiratesDepartment of Statistics and Business Analytics, College of Business and Economics, United Arab Emirates University, Al Ain P.O. Box 15551, United Arab EmiratesResearchers using machine learning methods for classification can face challenges due to class imbalance, where a certain class is underrepresented. Over or under-sampling of minority or majority class observations, or solely relying on model selection for ensemble methods, may prove ineffective when the class imbalance ratio is extremely high. To address this issue, this paper proposes a method called enhance tree ensemble (ETE), based on generating synthetic data for minority class observations in conjunction with tree selection based on their performance on the training data. The proposed method first generates minority class instances to balance the training data and then uses the idea of tree selection by leveraging out-of-bag (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">E</mi><mi mathvariant="normal">T</mi><mi mathvariant="normal">E</mi></mrow><mrow><mi mathvariant="normal">O</mi><mi mathvariant="normal">O</mi><mi mathvariant="normal">B</mi></mrow></msub></mrow></semantics></math></inline-formula>) and sub-samples (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">E</mi><mi mathvariant="normal">T</mi><mi mathvariant="normal">E</mi></mrow><mrow><mi mathvariant="normal">S</mi><mi mathvariant="normal">S</mi></mrow></msub></mrow></semantics></math></inline-formula>) observations, respectively. The efficacy of the proposed method is assessed using twenty benchmark problems for binary classification with moderate to extreme class imbalance, comparing it against other well-known methods such as optimal tree ensemble (OTE), SMOTE random forest (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi>R</mi><mi>F</mi></mrow><mrow><mi>S</mi><mi>M</mi><mi>O</mi><mi>T</mi><mi>E</mi></mrow></msub></mrow></semantics></math></inline-formula>), oversampling random forest (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">R</mi><mi mathvariant="normal">F</mi></mrow><mrow><mi mathvariant="normal">O</mi><mi mathvariant="normal">S</mi></mrow></msub></mrow></semantics></math></inline-formula>), under-sampling random forest (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msub><mrow><mi mathvariant="normal">R</mi><mi mathvariant="normal">F</mi></mrow><mrow><mi mathvariant="normal">U</mi><mi mathvariant="normal">S</mi></mrow></msub></mrow></semantics></math></inline-formula>), k-nearest neighbor (k-NN), support vector machine (SVM), tree, and artificial neural network (ANN). Performance metrics such as classification error rate and precision are used for evaluation purposes. The analyses of the study revealed that the proposed method, based on data balancing and model selection, yielded better results than the other methods.https://www.mdpi.com/2227-7390/12/20/3243random foresttree selectionclassificationclass-imbalance problemsynthetic data generation |
| spellingShingle | Samir K. Safi Sheema Gul An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance Mathematics random forest tree selection classification class-imbalance problem synthetic data generation |
| title | An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance |
| title_full | An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance |
| title_fullStr | An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance |
| title_full_unstemmed | An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance |
| title_short | An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance |
| title_sort | enhanced tree ensemble for classification in the presence of extreme class imbalance |
| topic | random forest tree selection classification class-imbalance problem synthetic data generation |
| url | https://www.mdpi.com/2227-7390/12/20/3243 |
| work_keys_str_mv | AT samirksafi anenhancedtreeensembleforclassificationinthepresenceofextremeclassimbalance AT sheemagul anenhancedtreeensembleforclassificationinthepresenceofextremeclassimbalance AT samirksafi enhancedtreeensembleforclassificationinthepresenceofextremeclassimbalance AT sheemagul enhancedtreeensembleforclassificationinthepresenceofextremeclassimbalance |