Effective Techniques for Handling Missing Values in Thyroid Disease Diagnosis: A Comparative Analysis
Handling missing values presents a critical challenge in thyroid disease prediction, significantly impacting diagnostic accuracy. This study evaluates the effectiveness of cold-deck, mean, and K-nearest neighbor (KNN) imputation techniques for predicting thyroid disease using a dataset of 9172 obser...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Wiley
2025-01-01
|
| Series: | Applied Computational Intelligence and Soft Computing |
| Online Access: | http://dx.doi.org/10.1155/acis/2766701 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Handling missing values presents a critical challenge in thyroid disease prediction, significantly impacting diagnostic accuracy. This study evaluates the effectiveness of cold-deck, mean, and K-nearest neighbor (KNN) imputation techniques for predicting thyroid disease using a dataset of 9172 observations with 31 clinical features (5.2% missing values). Feature importance analysis identified thyroid-stimulating hormone (TSH), thyroxine (TT4), and free thyroxine index (FTI) as consistently significant biomarkers across all imputation methods. Five classifiers—Naïve Bayes, linear regression, support vector machines (SVM), LightGBM, and recurrent neural networks (RNN)—were assessed on imputed datasets, with performance evaluated through accuracy, F1 score, and recall. The KNN imputation method enhanced LightGBM’s accuracy by 0.47% over mean imputation (99.06% vs. 98.99%) and by 1.47% over cold deck (99.06% vs. 98.59%), demonstrating its superiority in preserving feature relationships and enhancing predictive power. LightGBM achieved the highest performance with KNN imputation (accuracy: 99.06%, F1: 97.57%, and recall: 97.83%), outperforming other classifiers by 2.5%–4.0% in accuracy. These results underscore the necessity of robust imputation techniques for reliable thyroid disease prediction. The study provides a reproducible framework for managing missing data in healthcare analytics, emphasizing the interplay between imputation, feature importance, and classifier selection to optimize diagnostic accuracy. |
|---|---|
| ISSN: | 1687-9732 |