COVID-19 Data Analysis: The Impact of Missing Data Imputation on Supervised Learning Model Performance
The global COVID-19 pandemic has generated extensive datasets, providing opportunities to apply machine learning for diagnostic purposes. This study evaluates the performance of five supervised learning models—Random Forests (RFs), Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), L...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-03-01
|
| Series: | Computation |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2079-3197/13/3/70 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849341839113453568 |
|---|---|
| author | Jorge Daniel Mello-Román Adrián Martínez-Amarilla |
| author_facet | Jorge Daniel Mello-Román Adrián Martínez-Amarilla |
| author_sort | Jorge Daniel Mello-Román |
| collection | DOAJ |
| description | The global COVID-19 pandemic has generated extensive datasets, providing opportunities to apply machine learning for diagnostic purposes. This study evaluates the performance of five supervised learning models—Random Forests (RFs), Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Logistic Regression (LR), and Decision Trees (DTs)—on a hospital-based dataset from the Concepción Department in Paraguay. To address missing data, four imputation methods (Predictive Mean Matching via MICE, RF-based imputation, K-Nearest Neighbor, and XGBoost-based imputation) were tested. Model performance was compared using metrics such as accuracy, AUC, F1-score, and MCC across five levels of missingness. Overall, RF consistently achieved high accuracy and AUC at the highest missingness level, underscoring its robustness. In contrast, SVM often exhibited a trade-off between specificity and sensitivity. ANN and DT showed moderate resilience, yet were more prone to performance shifts under certain imputation approaches. These findings highlight RF’s adaptability to different imputation strategies, as well as the importance of selecting methods that minimize sensitivity–specificity trade-offs. By comparing multiple imputation techniques and supervised models, this study provides practical insights for handling missing medical data in resource-constrained settings and underscores the value of robust ensemble methods for reliable COVID-19 diagnostics. |
| format | Article |
| id | doaj-art-978b015bd3744bd693d7ae1a0d3b4e92 |
| institution | Kabale University |
| issn | 2079-3197 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Computation |
| spelling | doaj-art-978b015bd3744bd693d7ae1a0d3b4e922025-08-20T03:43:33ZengMDPI AGComputation2079-31972025-03-011337010.3390/computation13030070COVID-19 Data Analysis: The Impact of Missing Data Imputation on Supervised Learning Model PerformanceJorge Daniel Mello-Román0Adrián Martínez-Amarilla1Faculty of Exact and Technological Sciences, Universidad Nacional de Concepción, Concepción 010123, ParaguayFaculty of Exact and Technological Sciences, Universidad Nacional de Concepción, Concepción 010123, ParaguayThe global COVID-19 pandemic has generated extensive datasets, providing opportunities to apply machine learning for diagnostic purposes. This study evaluates the performance of five supervised learning models—Random Forests (RFs), Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Logistic Regression (LR), and Decision Trees (DTs)—on a hospital-based dataset from the Concepción Department in Paraguay. To address missing data, four imputation methods (Predictive Mean Matching via MICE, RF-based imputation, K-Nearest Neighbor, and XGBoost-based imputation) were tested. Model performance was compared using metrics such as accuracy, AUC, F1-score, and MCC across five levels of missingness. Overall, RF consistently achieved high accuracy and AUC at the highest missingness level, underscoring its robustness. In contrast, SVM often exhibited a trade-off between specificity and sensitivity. ANN and DT showed moderate resilience, yet were more prone to performance shifts under certain imputation approaches. These findings highlight RF’s adaptability to different imputation strategies, as well as the importance of selecting methods that minimize sensitivity–specificity trade-offs. By comparing multiple imputation techniques and supervised models, this study provides practical insights for handling missing medical data in resource-constrained settings and underscores the value of robust ensemble methods for reliable COVID-19 diagnostics.https://www.mdpi.com/2079-3197/13/3/70supervised learning methodsCOVID-19 diagnosticsdata imputationmachine learning models |
| spellingShingle | Jorge Daniel Mello-Román Adrián Martínez-Amarilla COVID-19 Data Analysis: The Impact of Missing Data Imputation on Supervised Learning Model Performance Computation supervised learning methods COVID-19 diagnostics data imputation machine learning models |
| title | COVID-19 Data Analysis: The Impact of Missing Data Imputation on Supervised Learning Model Performance |
| title_full | COVID-19 Data Analysis: The Impact of Missing Data Imputation on Supervised Learning Model Performance |
| title_fullStr | COVID-19 Data Analysis: The Impact of Missing Data Imputation on Supervised Learning Model Performance |
| title_full_unstemmed | COVID-19 Data Analysis: The Impact of Missing Data Imputation on Supervised Learning Model Performance |
| title_short | COVID-19 Data Analysis: The Impact of Missing Data Imputation on Supervised Learning Model Performance |
| title_sort | covid 19 data analysis the impact of missing data imputation on supervised learning model performance |
| topic | supervised learning methods COVID-19 diagnostics data imputation machine learning models |
| url | https://www.mdpi.com/2079-3197/13/3/70 |
| work_keys_str_mv | AT jorgedanielmelloroman covid19dataanalysistheimpactofmissingdataimputationonsupervisedlearningmodelperformance AT adrianmartinezamarilla covid19dataanalysistheimpactofmissingdataimputationonsupervisedlearningmodelperformance |