An optimized data analytics pipeline for improving healthcare diagnosis using ensemble learning
Healthcare diagnosis is a process physicians follow before prescribing the patients. The medical doctors may make an early prediction by observing the physical signs and symptoms. Imposing a treatment without proper diagnosis cannot guarantee a cure and sometimes may lead the patient to a more detri...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-01-01
|
Series: | Informatics in Medicine Unlocked |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352914825000115 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1823859404251856896 |
---|---|
author | Lomat Haider Chowdhury Shaira Tabassum Swakkhar Shatabda Ashir Ahmed |
author_facet | Lomat Haider Chowdhury Shaira Tabassum Swakkhar Shatabda Ashir Ahmed |
author_sort | Lomat Haider Chowdhury |
collection | DOAJ |
description | Healthcare diagnosis is a process physicians follow before prescribing the patients. The medical doctors may make an early prediction by observing the physical signs and symptoms. Imposing a treatment without proper diagnosis cannot guarantee a cure and sometimes may lead the patient to a more detrimental scenario. However, the cost of healthcare diagnosis makes people indifferent to going through the process. Big data and machine learning are already in use to contribute to the healthcare diagnosis sector with the available data which is enormously growing through the digitalization of the system. Yet the difficulty remains since the raw data contains noise including missing values, outliers, and an imbalanced number of samples. These properties in a dataset make it challenging to implement any diagnosis model. A complete patient profile cannot be generated due to missing values, which may affect the final prediction. Outliers in a medical dataset represent extreme cases and rare conditions, or they may even be generated due to data entry errors. An excessive number of outliers may lead to a skewed and incorrect prediction. An imbalanced dataset makes it challenging to identify the minority classes appropriately and mostly generates a biased model for majority class instances. A combination of advanced preprocessing techniques and reliable model selection are required to address these challenges effectively. This paper proposes a data analytics pipeline on a Portable Health Clinic (PHC) dataset. The paper systematically evaluates different preprocessing methods for missing value imputation, outliers detection, and data balancing and offers a comprehensive preprocessing framework. Later, five state-of-the-art ensemble models for healthcare diagnosis were implemented along with a proposed ensemble machine learning model, KNN-XGBoost-SVM-Random Forest (KNN-X-SVM-R). The proposed model achieved an accuracy of 97.03% which supersedes all the other state-of-the-art models. To reaffirm the rectification of our model, we experimented with it on another COVID-19 routine blood test dataset. In both cases, our proposed model acquired better results regarding different performance measures. Validating the approach on a secondary dataset strengthens the robustness of the proposed methodology. The recommended preprocessing and modeling approach can be adopted to enhance diagnostic systems and improve patient outcomes. |
format | Article |
id | doaj-art-7b16d736fecf4cd0b34c5b9d873e02c2 |
institution | Kabale University |
issn | 2352-9148 |
language | English |
publishDate | 2025-01-01 |
publisher | Elsevier |
record_format | Article |
series | Informatics in Medicine Unlocked |
spelling | doaj-art-7b16d736fecf4cd0b34c5b9d873e02c22025-02-11T04:35:08ZengElsevierInformatics in Medicine Unlocked2352-91482025-01-0153101623An optimized data analytics pipeline for improving healthcare diagnosis using ensemble learningLomat Haider Chowdhury0Shaira Tabassum1Swakkhar Shatabda2Ashir Ahmed3Ahsanullah University of Science and Technology, Dhaka, Bangladesh; Corresponding author.Norwegian University of Science and Technology, Trondheim, NorwayBRAC University, Dhaka, BangladeshKyushu University, Fukuoka, JapanHealthcare diagnosis is a process physicians follow before prescribing the patients. The medical doctors may make an early prediction by observing the physical signs and symptoms. Imposing a treatment without proper diagnosis cannot guarantee a cure and sometimes may lead the patient to a more detrimental scenario. However, the cost of healthcare diagnosis makes people indifferent to going through the process. Big data and machine learning are already in use to contribute to the healthcare diagnosis sector with the available data which is enormously growing through the digitalization of the system. Yet the difficulty remains since the raw data contains noise including missing values, outliers, and an imbalanced number of samples. These properties in a dataset make it challenging to implement any diagnosis model. A complete patient profile cannot be generated due to missing values, which may affect the final prediction. Outliers in a medical dataset represent extreme cases and rare conditions, or they may even be generated due to data entry errors. An excessive number of outliers may lead to a skewed and incorrect prediction. An imbalanced dataset makes it challenging to identify the minority classes appropriately and mostly generates a biased model for majority class instances. A combination of advanced preprocessing techniques and reliable model selection are required to address these challenges effectively. This paper proposes a data analytics pipeline on a Portable Health Clinic (PHC) dataset. The paper systematically evaluates different preprocessing methods for missing value imputation, outliers detection, and data balancing and offers a comprehensive preprocessing framework. Later, five state-of-the-art ensemble models for healthcare diagnosis were implemented along with a proposed ensemble machine learning model, KNN-XGBoost-SVM-Random Forest (KNN-X-SVM-R). The proposed model achieved an accuracy of 97.03% which supersedes all the other state-of-the-art models. To reaffirm the rectification of our model, we experimented with it on another COVID-19 routine blood test dataset. In both cases, our proposed model acquired better results regarding different performance measures. Validating the approach on a secondary dataset strengthens the robustness of the proposed methodology. The recommended preprocessing and modeling approach can be adopted to enhance diagnostic systems and improve patient outcomes.http://www.sciencedirect.com/science/article/pii/S2352914825000115Healthcare diagnosisData analyticsNoise handlingPortable Health ClinicCOVID-19 |
spellingShingle | Lomat Haider Chowdhury Shaira Tabassum Swakkhar Shatabda Ashir Ahmed An optimized data analytics pipeline for improving healthcare diagnosis using ensemble learning Informatics in Medicine Unlocked Healthcare diagnosis Data analytics Noise handling Portable Health Clinic COVID-19 |
title | An optimized data analytics pipeline for improving healthcare diagnosis using ensemble learning |
title_full | An optimized data analytics pipeline for improving healthcare diagnosis using ensemble learning |
title_fullStr | An optimized data analytics pipeline for improving healthcare diagnosis using ensemble learning |
title_full_unstemmed | An optimized data analytics pipeline for improving healthcare diagnosis using ensemble learning |
title_short | An optimized data analytics pipeline for improving healthcare diagnosis using ensemble learning |
title_sort | optimized data analytics pipeline for improving healthcare diagnosis using ensemble learning |
topic | Healthcare diagnosis Data analytics Noise handling Portable Health Clinic COVID-19 |
url | http://www.sciencedirect.com/science/article/pii/S2352914825000115 |
work_keys_str_mv | AT lomathaiderchowdhury anoptimizeddataanalyticspipelineforimprovinghealthcarediagnosisusingensemblelearning AT shairatabassum anoptimizeddataanalyticspipelineforimprovinghealthcarediagnosisusingensemblelearning AT swakkharshatabda anoptimizeddataanalyticspipelineforimprovinghealthcarediagnosisusingensemblelearning AT ashirahmed anoptimizeddataanalyticspipelineforimprovinghealthcarediagnosisusingensemblelearning AT lomathaiderchowdhury optimizeddataanalyticspipelineforimprovinghealthcarediagnosisusingensemblelearning AT shairatabassum optimizeddataanalyticspipelineforimprovinghealthcarediagnosisusingensemblelearning AT swakkharshatabda optimizeddataanalyticspipelineforimprovinghealthcarediagnosisusingensemblelearning AT ashirahmed optimizeddataanalyticspipelineforimprovinghealthcarediagnosisusingensemblelearning |