Improving Cardiovascular Disease Prediction through Stratified Machine Learning Models and Combined Datasets

The global rise in cardiovascular disease (CVD) cases underscores the critical need for accurate and early diagnostic solutions. This study introduces a robust machine learning (ML) framework for predicting CVD risk by integrating two large, feature-identical datasets containing clinical and biologi...

Full description

Saved in:
Bibliographic Details
Main Authors: Tara Yousif Mawlood, Alla Ahmad Hassan, Rebwar Khalid Muhammed, Aso M. Aladdin, Tarik A. Rashid, Bryar A. Hassan
Format: Article
Language:English
Published: University of Human Development 2025-06-01
Series:UHD Journal of Science and Technology
Subjects:
Online Access:https://journals.uhd.edu.iq/index.php/uhdjst/article/view/1447
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The global rise in cardiovascular disease (CVD) cases underscores the critical need for accurate and early diagnostic solutions. This study introduces a robust machine learning (ML) framework for predicting CVD risk by integrating two large, feature-identical datasets containing clinical and biological indicators along with patient history. Seven classification algorithms – logistic regression, random forest (RF), support vector machine (SVM), Gaussian naive Bayes (GNB), gradient boosting (GB), K-nearest neighbors, and decision tree (DT) – were employed. A stratified sampling strategy was used to ensure balanced class distribution, and model performance was further validated using k-fold cross-validation to enhance robustness and generalizability. The datasets, sourced from the UCI repository, were pre-processed and evaluated using metrics such as accuracy, precision, F1-score, log loss, and error rate, with performance further assessed using confusion matrices. Results revealed that ensemble models, particularly RF and DT, achieved optimal performance with 100% accuracy, while stratification significantly improved the outcomes of SVM, GNB, and GB. The integration of datasets, stratified sampling, and k-fold validation effectively enhanced model reliability while minimizing overfitting. These findings highlight the potential of ML to support early CVD diagnosis and lay the groundwork for future research on hybrid models and real-world clinical applications.
ISSN:2521-4209
2521-4217