Data Augmentation and Machine Learning algorithms for multi-class imbalanced morphometrics data of stingless bees

The study focusses on handling of multiclass imbalanced data on classification of stingless bee samples by employing data balancing techniques, namely Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic (ADASYN) approach. These techniques are applied in combination with machine...

Full description

Saved in:
Bibliographic Details
Main Authors: Daisy Salifu, Lorna Chepkemoi, Eric Ali Ibrahim, Kiatoko Nkoba, Henri E.Z. Tonnang
Format: Article
Language:English
Published: Elsevier 2025-02-01
Series:Heliyon
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2405844025005948
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832583118170619904
author Daisy Salifu
Lorna Chepkemoi
Eric Ali Ibrahim
Kiatoko Nkoba
Henri E.Z. Tonnang
author_facet Daisy Salifu
Lorna Chepkemoi
Eric Ali Ibrahim
Kiatoko Nkoba
Henri E.Z. Tonnang
author_sort Daisy Salifu
collection DOAJ
description The study focusses on handling of multiclass imbalanced data on classification of stingless bee samples by employing data balancing techniques, namely Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic (ADASYN) approach. These techniques are applied in combination with machine learning (ML) algorithms; specifically Random Forest (RF), and Support Vector Machine (SVM), to assess the models’ predictive performance to infer stingless bee samples identities. We studied ML classifier models: RF, RF + SMOTE, RF + ADASYN, SVM, SVM + SMOTE and SVM + ADASYN on the six-class imbalanced dataset of stingless bees morphometrics. Multi-class area under curve (AUC), F1-score, G-mean, balanced accuracy, sensitivity and “No information rate” were used to assess model performance. SMOTE and ADASYN marginally improved the performance of RF and SVM classifiers. SVM outperformed RF, with SVM using SMOTE performing better than with ADASYN. SVM with ADASYN had a lower multi-class AUC (0.9898) and sensitivity (0.956) but a higher F1-score (0.939) compared to SVM with SMOTE (AUC = 0.9918, sensitivity = 0.959, F1-score = 0.934). Overall, SVM with SMOTE was superior to RF with SMOTE. All models except SVM with ADASYN, correctly classified four of the six species, M. (Meliponula) bocandei, M. (Meliplebeia) lendliana, D. schmidti and P. armata but not the two morphs, Meliponula (Axestotrigona) togoensis and Meliponula (Axestotrigona) ferruginea. This study therefore confirms that the impact of imbalanced learning is minimal when classes are separable. Random forest recursive feature elimination technique was used to assess variable importance, guiding future studies on key morphometric measurements to save time and cost while maintaining high classification performance. Our results pave the way for the development of smart and automated machine learning applications to complement the existing methods for the identification of stingless bee species.
format Article
id doaj-art-ca66da1348954a2a84c886811d7df7d3
institution Kabale University
issn 2405-8440
language English
publishDate 2025-02-01
publisher Elsevier
record_format Article
series Heliyon
spelling doaj-art-ca66da1348954a2a84c886811d7df7d32025-01-29T05:01:32ZengElsevierHeliyon2405-84402025-02-01113e42214Data Augmentation and Machine Learning algorithms for multi-class imbalanced morphometrics data of stingless beesDaisy Salifu0Lorna Chepkemoi1Eric Ali Ibrahim2Kiatoko Nkoba3Henri E.Z. Tonnang4Corresponding author. P.O. Box 30772 – 00100, Nairobi, Kenya.; International Centre of Insect Physiology and Ecology (icipe), P.O. Box 30772, Nairobi, KenyaInternational Centre of Insect Physiology and Ecology (icipe), P.O. Box 30772, Nairobi, KenyaInternational Centre of Insect Physiology and Ecology (icipe), P.O. Box 30772, Nairobi, KenyaInternational Centre of Insect Physiology and Ecology (icipe), P.O. Box 30772, Nairobi, KenyaInternational Centre of Insect Physiology and Ecology (icipe), P.O. Box 30772, Nairobi, KenyaThe study focusses on handling of multiclass imbalanced data on classification of stingless bee samples by employing data balancing techniques, namely Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic (ADASYN) approach. These techniques are applied in combination with machine learning (ML) algorithms; specifically Random Forest (RF), and Support Vector Machine (SVM), to assess the models’ predictive performance to infer stingless bee samples identities. We studied ML classifier models: RF, RF + SMOTE, RF + ADASYN, SVM, SVM + SMOTE and SVM + ADASYN on the six-class imbalanced dataset of stingless bees morphometrics. Multi-class area under curve (AUC), F1-score, G-mean, balanced accuracy, sensitivity and “No information rate” were used to assess model performance. SMOTE and ADASYN marginally improved the performance of RF and SVM classifiers. SVM outperformed RF, with SVM using SMOTE performing better than with ADASYN. SVM with ADASYN had a lower multi-class AUC (0.9898) and sensitivity (0.956) but a higher F1-score (0.939) compared to SVM with SMOTE (AUC = 0.9918, sensitivity = 0.959, F1-score = 0.934). Overall, SVM with SMOTE was superior to RF with SMOTE. All models except SVM with ADASYN, correctly classified four of the six species, M. (Meliponula) bocandei, M. (Meliplebeia) lendliana, D. schmidti and P. armata but not the two morphs, Meliponula (Axestotrigona) togoensis and Meliponula (Axestotrigona) ferruginea. This study therefore confirms that the impact of imbalanced learning is minimal when classes are separable. Random forest recursive feature elimination technique was used to assess variable importance, guiding future studies on key morphometric measurements to save time and cost while maintaining high classification performance. Our results pave the way for the development of smart and automated machine learning applications to complement the existing methods for the identification of stingless bee species.http://www.sciencedirect.com/science/article/pii/S2405844025005948Imbalanced dataStingless beesRandom forestSVMSMOTEADASYN
spellingShingle Daisy Salifu
Lorna Chepkemoi
Eric Ali Ibrahim
Kiatoko Nkoba
Henri E.Z. Tonnang
Data Augmentation and Machine Learning algorithms for multi-class imbalanced morphometrics data of stingless bees
Heliyon
Imbalanced data
Stingless bees
Random forest
SVM
SMOTE
ADASYN
title Data Augmentation and Machine Learning algorithms for multi-class imbalanced morphometrics data of stingless bees
title_full Data Augmentation and Machine Learning algorithms for multi-class imbalanced morphometrics data of stingless bees
title_fullStr Data Augmentation and Machine Learning algorithms for multi-class imbalanced morphometrics data of stingless bees
title_full_unstemmed Data Augmentation and Machine Learning algorithms for multi-class imbalanced morphometrics data of stingless bees
title_short Data Augmentation and Machine Learning algorithms for multi-class imbalanced morphometrics data of stingless bees
title_sort data augmentation and machine learning algorithms for multi class imbalanced morphometrics data of stingless bees
topic Imbalanced data
Stingless bees
Random forest
SVM
SMOTE
ADASYN
url http://www.sciencedirect.com/science/article/pii/S2405844025005948
work_keys_str_mv AT daisysalifu dataaugmentationandmachinelearningalgorithmsformulticlassimbalancedmorphometricsdataofstinglessbees
AT lornachepkemoi dataaugmentationandmachinelearningalgorithmsformulticlassimbalancedmorphometricsdataofstinglessbees
AT ericaliibrahim dataaugmentationandmachinelearningalgorithmsformulticlassimbalancedmorphometricsdataofstinglessbees
AT kiatokonkoba dataaugmentationandmachinelearningalgorithmsformulticlassimbalancedmorphometricsdataofstinglessbees
AT henrieztonnang dataaugmentationandmachinelearningalgorithmsformulticlassimbalancedmorphometricsdataofstinglessbees