Synthetic Data Generation and Evaluation Techniques for Classifiers in Data Starved Medical Applications

With their ability to find solutions among complex relationships of variables, machine learning (ML) techniques are becoming more applicable to various fields, including health risk prediction. However, prediction models are sensitive to the size and distribution of the data they are trained on. ML...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wan D. Bae, Shayma Alkobaisi, Matthew Horak, Siddheshwari Bankar, Sartaj Bhuvaji, Sungroul Kim, Choon-Sik Park
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Autoencoders class imbalance problem control coefficient data starved contexts rare event prediction synthetic minority oversampling technique
Online Access:	https://ieeexplore.ieee.org/document/10847858/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832576774144262144
author	Wan D. Bae Shayma Alkobaisi Matthew Horak Siddheshwari Bankar Sartaj Bhuvaji Sungroul Kim Choon-Sik Park
author_facet	Wan D. Bae Shayma Alkobaisi Matthew Horak Siddheshwari Bankar Sartaj Bhuvaji Sungroul Kim Choon-Sik Park
author_sort	Wan D. Bae
collection	DOAJ
description	With their ability to find solutions among complex relationships of variables, machine learning (ML) techniques are becoming more applicable to various fields, including health risk prediction. However, prediction models are sensitive to the size and distribution of the data they are trained on. ML algorithms rely heavily on vast quantities of training data to make accurate predictions. Ideally, the dataset should have an equal number of samples for each label to encourage the model to make predictions based on the input data rather than the distribution of the training data. In medical applications, class imbalance is a common issue because the occurrence of a disease or risk episode is often rare. This leads to a training dataset where healthy cases outnumber unhealthy ones, resulting in biased prediction models that struggle to detect the minority, unhealthy cases effectively. This paper addresses the problem of class imbalance, given the scarcity of training datasets by improving the quality of generated data. We propose an incremental synthetic data generation system that improves data quality over iterations by gradually adjusting to the data distribution and thus avoids overfitting in classifiers. Through extensive experimental assessments on real asthma patients’ datasets, we demonstrate the efficiency and applicability of our proposed system for individual-based health risk prediction models. Incremental SMOTE methods were compared to the original SMOTE variants as well as various architectures of autoencoders. Our incremental data generation system enhances selected state-of-the-art SMOTE methods, resulting in sensitivity improvements for deep transfer learning (TL) classifiers ranging from 4.01% to 7.79%. Compared with the performance of TL without oversampling, the improvement achieved by the incremental SMOTE methods ranged from 27.18% to 40.97%. These results highlight the effectiveness of our technique in predicting asthma risk and their applicability to imbalanced, data-starved medical contexts.
format	Article
id	doaj-art-9b64f8c069834b8297e1e23c20841ecc
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-9b64f8c069834b8297e1e23c20841ecc2025-01-31T00:01:13ZengIEEEIEEE Access2169-35362025-01-0113165841660210.1109/ACCESS.2025.353222210847858Synthetic Data Generation and Evaluation Techniques for Classifiers in Data Starved Medical ApplicationsWan D. Bae0https://orcid.org/0000-0002-4611-5546Shayma Alkobaisi1https://orcid.org/0000-0003-4237-7976Matthew Horak2https://orcid.org/0009-0008-3968-3626Siddheshwari Bankar3https://orcid.org/0009-0004-1613-3569Sartaj Bhuvaji4https://orcid.org/0009-0006-4594-7857Sungroul Kim5https://orcid.org/0000-0001-8726-9288Choon-Sik Park6https://orcid.org/0000-0001-7955-2526Department of Computer Science, Seattle University, Seattle, WA, USACollege of Information Technology, United Arab Emirates University, Al Ain, United Arab EmiratesAmazon AWS Lambda, Seattle, WA, USADepartment of Computer Science, Seattle University, Seattle, WA, USADepartment of Computer Science, Seattle University, Seattle, WA, USADepartment of ICT Environmental Health System, Graduate School, Soonchunhyang University, Asan, South KoreaDepartment of Internal Medicine, Soonchunhyang University Bucheon Hospital, Bucheon, South KoreaWith their ability to find solutions among complex relationships of variables, machine learning (ML) techniques are becoming more applicable to various fields, including health risk prediction. However, prediction models are sensitive to the size and distribution of the data they are trained on. ML algorithms rely heavily on vast quantities of training data to make accurate predictions. Ideally, the dataset should have an equal number of samples for each label to encourage the model to make predictions based on the input data rather than the distribution of the training data. In medical applications, class imbalance is a common issue because the occurrence of a disease or risk episode is often rare. This leads to a training dataset where healthy cases outnumber unhealthy ones, resulting in biased prediction models that struggle to detect the minority, unhealthy cases effectively. This paper addresses the problem of class imbalance, given the scarcity of training datasets by improving the quality of generated data. We propose an incremental synthetic data generation system that improves data quality over iterations by gradually adjusting to the data distribution and thus avoids overfitting in classifiers. Through extensive experimental assessments on real asthma patients’ datasets, we demonstrate the efficiency and applicability of our proposed system for individual-based health risk prediction models. Incremental SMOTE methods were compared to the original SMOTE variants as well as various architectures of autoencoders. Our incremental data generation system enhances selected state-of-the-art SMOTE methods, resulting in sensitivity improvements for deep transfer learning (TL) classifiers ranging from 4.01% to 7.79%. Compared with the performance of TL without oversampling, the improvement achieved by the incremental SMOTE methods ranged from 27.18% to 40.97%. These results highlight the effectiveness of our technique in predicting asthma risk and their applicability to imbalanced, data-starved medical contexts.https://ieeexplore.ieee.org/document/10847858/Autoencodersclass imbalance problemcontrol coefficientdata starved contextsrare event predictionsynthetic minority oversampling technique
spellingShingle	Wan D. Bae Shayma Alkobaisi Matthew Horak Siddheshwari Bankar Sartaj Bhuvaji Sungroul Kim Choon-Sik Park Synthetic Data Generation and Evaluation Techniques for Classifiers in Data Starved Medical Applications IEEE Access Autoencoders class imbalance problem control coefficient data starved contexts rare event prediction synthetic minority oversampling technique
title	Synthetic Data Generation and Evaluation Techniques for Classifiers in Data Starved Medical Applications
title_full	Synthetic Data Generation and Evaluation Techniques for Classifiers in Data Starved Medical Applications
title_fullStr	Synthetic Data Generation and Evaluation Techniques for Classifiers in Data Starved Medical Applications
title_full_unstemmed	Synthetic Data Generation and Evaluation Techniques for Classifiers in Data Starved Medical Applications
title_short	Synthetic Data Generation and Evaluation Techniques for Classifiers in Data Starved Medical Applications
title_sort	synthetic data generation and evaluation techniques for classifiers in data starved medical applications
topic	Autoencoders class imbalance problem control coefficient data starved contexts rare event prediction synthetic minority oversampling technique
url	https://ieeexplore.ieee.org/document/10847858/
work_keys_str_mv	AT wandbae syntheticdatagenerationandevaluationtechniquesforclassifiersindatastarvedmedicalapplications AT shaymaalkobaisi syntheticdatagenerationandevaluationtechniquesforclassifiersindatastarvedmedicalapplications AT matthewhorak syntheticdatagenerationandevaluationtechniquesforclassifiersindatastarvedmedicalapplications AT siddheshwaribankar syntheticdatagenerationandevaluationtechniquesforclassifiersindatastarvedmedicalapplications AT sartajbhuvaji syntheticdatagenerationandevaluationtechniquesforclassifiersindatastarvedmedicalapplications AT sungroulkim syntheticdatagenerationandevaluationtechniquesforclassifiersindatastarvedmedicalapplications AT choonsikpark syntheticdatagenerationandevaluationtechniquesforclassifiersindatastarvedmedicalapplications

Synthetic Data Generation and Evaluation Techniques for Classifiers in Data Starved Medical Applications

Similar Items