Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approach

Abstract Since machine learning algorithms rely on data, the way datasets are collected significantly impacts their performance. Data must be carefully gathered to minimize missing values or class imbalance. However, the inherent nature of the data tends can sometimes lead to such imbalances. An unb...

Full description

Saved in:
Bibliographic Details
Main Authors: Alberto Nogales, Diego Guadalupe, Álvaro J. García-Tejedor
Format: Article
Language:English
Published: SpringerOpen 2025-06-01
Series:Journal of Big Data
Subjects:
Online Access:https://doi.org/10.1186/s40537-025-01188-5
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849434278355533824
author Alberto Nogales
Diego Guadalupe
Álvaro J. García-Tejedor
author_facet Alberto Nogales
Diego Guadalupe
Álvaro J. García-Tejedor
author_sort Alberto Nogales
collection DOAJ
description Abstract Since machine learning algorithms rely on data, the way datasets are collected significantly impacts their performance. Data must be carefully gathered to minimize missing values or class imbalance. However, the inherent nature of the data tends can sometimes lead to such imbalances. An unbalanced dataset can lead to biased models, where predictions are influenced by the majority class. To avoid this problem, balancing strategies can be applied to equalize the instances of each class. This paper introduces a methodological approach to evaluate which balancing strategies yield the best results depending on the dataset. We leverage self-organizing maps, an unsupervised neural network model, to identify which strategy generates the most suitable balanced synthetic data. By considering the topological structure of the data, we propose a metric that uses the trained map to measure changes between the original dataset and the transformed dataset after applying different strategies. This metric is based on the idea that synthetic data resembling the original dataset more closely is preferable.
format Article
id doaj-art-792a306bf27345ffb41d4b0eaac2b5fb
institution Kabale University
issn 2196-1115
language English
publishDate 2025-06-01
publisher SpringerOpen
record_format Article
series Journal of Big Data
spelling doaj-art-792a306bf27345ffb41d4b0eaac2b5fb2025-08-20T03:26:43ZengSpringerOpenJournal of Big Data2196-11152025-06-0112113210.1186/s40537-025-01188-5Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approachAlberto Nogales0Diego Guadalupe1Álvaro J. García-Tejedor2CEIEC, Research Institute, Universidad Francisco de VitoriaCEIEC, Research Institute, Universidad Francisco de VitoriaCEIEC, Research Institute, Universidad Francisco de VitoriaAbstract Since machine learning algorithms rely on data, the way datasets are collected significantly impacts their performance. Data must be carefully gathered to minimize missing values or class imbalance. However, the inherent nature of the data tends can sometimes lead to such imbalances. An unbalanced dataset can lead to biased models, where predictions are influenced by the majority class. To avoid this problem, balancing strategies can be applied to equalize the instances of each class. This paper introduces a methodological approach to evaluate which balancing strategies yield the best results depending on the dataset. We leverage self-organizing maps, an unsupervised neural network model, to identify which strategy generates the most suitable balanced synthetic data. By considering the topological structure of the data, we propose a metric that uses the trained map to measure changes between the original dataset and the transformed dataset after applying different strategies. This metric is based on the idea that synthetic data resembling the original dataset more closely is preferable.https://doi.org/10.1186/s40537-025-01188-5Unbalanced datasetsBalancing strategiesArtificial intelligenceMachine learningSelf-organizing map
spellingShingle Alberto Nogales
Diego Guadalupe
Álvaro J. García-Tejedor
Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approach
Journal of Big Data
Unbalanced datasets
Balancing strategies
Artificial intelligence
Machine learning
Self-organizing map
title Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approach
title_full Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approach
title_fullStr Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approach
title_full_unstemmed Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approach
title_short Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approach
title_sort self organizing maps to evaluate optimal strategies for balancing binary class distributions a methodological approach
topic Unbalanced datasets
Balancing strategies
Artificial intelligence
Machine learning
Self-organizing map
url https://doi.org/10.1186/s40537-025-01188-5
work_keys_str_mv AT albertonogales selforganizingmapstoevaluateoptimalstrategiesforbalancingbinaryclassdistributionsamethodologicalapproach
AT diegoguadalupe selforganizingmapstoevaluateoptimalstrategiesforbalancingbinaryclassdistributionsamethodologicalapproach
AT alvarojgarciatejedor selforganizingmapstoevaluateoptimalstrategiesforbalancingbinaryclassdistributionsamethodologicalapproach