Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approach
Abstract Since machine learning algorithms rely on data, the way datasets are collected significantly impacts their performance. Data must be carefully gathered to minimize missing values or class imbalance. However, the inherent nature of the data tends can sometimes lead to such imbalances. An unb...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
SpringerOpen
2025-06-01
|
| Series: | Journal of Big Data |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s40537-025-01188-5 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849434278355533824 |
|---|---|
| author | Alberto Nogales Diego Guadalupe Álvaro J. García-Tejedor |
| author_facet | Alberto Nogales Diego Guadalupe Álvaro J. García-Tejedor |
| author_sort | Alberto Nogales |
| collection | DOAJ |
| description | Abstract Since machine learning algorithms rely on data, the way datasets are collected significantly impacts their performance. Data must be carefully gathered to minimize missing values or class imbalance. However, the inherent nature of the data tends can sometimes lead to such imbalances. An unbalanced dataset can lead to biased models, where predictions are influenced by the majority class. To avoid this problem, balancing strategies can be applied to equalize the instances of each class. This paper introduces a methodological approach to evaluate which balancing strategies yield the best results depending on the dataset. We leverage self-organizing maps, an unsupervised neural network model, to identify which strategy generates the most suitable balanced synthetic data. By considering the topological structure of the data, we propose a metric that uses the trained map to measure changes between the original dataset and the transformed dataset after applying different strategies. This metric is based on the idea that synthetic data resembling the original dataset more closely is preferable. |
| format | Article |
| id | doaj-art-792a306bf27345ffb41d4b0eaac2b5fb |
| institution | Kabale University |
| issn | 2196-1115 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | SpringerOpen |
| record_format | Article |
| series | Journal of Big Data |
| spelling | doaj-art-792a306bf27345ffb41d4b0eaac2b5fb2025-08-20T03:26:43ZengSpringerOpenJournal of Big Data2196-11152025-06-0112113210.1186/s40537-025-01188-5Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approachAlberto Nogales0Diego Guadalupe1Álvaro J. García-Tejedor2CEIEC, Research Institute, Universidad Francisco de VitoriaCEIEC, Research Institute, Universidad Francisco de VitoriaCEIEC, Research Institute, Universidad Francisco de VitoriaAbstract Since machine learning algorithms rely on data, the way datasets are collected significantly impacts their performance. Data must be carefully gathered to minimize missing values or class imbalance. However, the inherent nature of the data tends can sometimes lead to such imbalances. An unbalanced dataset can lead to biased models, where predictions are influenced by the majority class. To avoid this problem, balancing strategies can be applied to equalize the instances of each class. This paper introduces a methodological approach to evaluate which balancing strategies yield the best results depending on the dataset. We leverage self-organizing maps, an unsupervised neural network model, to identify which strategy generates the most suitable balanced synthetic data. By considering the topological structure of the data, we propose a metric that uses the trained map to measure changes between the original dataset and the transformed dataset after applying different strategies. This metric is based on the idea that synthetic data resembling the original dataset more closely is preferable.https://doi.org/10.1186/s40537-025-01188-5Unbalanced datasetsBalancing strategiesArtificial intelligenceMachine learningSelf-organizing map |
| spellingShingle | Alberto Nogales Diego Guadalupe Álvaro J. García-Tejedor Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approach Journal of Big Data Unbalanced datasets Balancing strategies Artificial intelligence Machine learning Self-organizing map |
| title | Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approach |
| title_full | Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approach |
| title_fullStr | Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approach |
| title_full_unstemmed | Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approach |
| title_short | Self-organizing maps to evaluate optimal strategies for balancing binary class distributions: a methodological approach |
| title_sort | self organizing maps to evaluate optimal strategies for balancing binary class distributions a methodological approach |
| topic | Unbalanced datasets Balancing strategies Artificial intelligence Machine learning Self-organizing map |
| url | https://doi.org/10.1186/s40537-025-01188-5 |
| work_keys_str_mv | AT albertonogales selforganizingmapstoevaluateoptimalstrategiesforbalancingbinaryclassdistributionsamethodologicalapproach AT diegoguadalupe selforganizingmapstoevaluateoptimalstrategiesforbalancingbinaryclassdistributionsamethodologicalapproach AT alvarojgarciatejedor selforganizingmapstoevaluateoptimalstrategiesforbalancingbinaryclassdistributionsamethodologicalapproach |