Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy

Abstract This primary research paper emphasizes cross-validation, where data samples are reshuffled in each iteration to form randomized subsets divided into n folds. This method improves model performance and achieves higher accuracy than the baseline model. The novelty lies in the data preparation...

Full description

Saved in:
Bibliographic Details
Main Authors: Yagyanath Rimal, Navneet Sharma, Siddhartha Paudel, Abeer Alsadoon, Madhav Parsad Koirala, Sumeet Gill
Format: Article
Language:English
Published: Nature Portfolio 2025-04-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-93675-1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850146324775698432
author Yagyanath Rimal
Navneet Sharma
Siddhartha Paudel
Abeer Alsadoon
Madhav Parsad Koirala
Sumeet Gill
author_facet Yagyanath Rimal
Navneet Sharma
Siddhartha Paudel
Abeer Alsadoon
Madhav Parsad Koirala
Sumeet Gill
author_sort Yagyanath Rimal
collection DOAJ
description Abstract This primary research paper emphasizes cross-validation, where data samples are reshuffled in each iteration to form randomized subsets divided into n folds. This method improves model performance and achieves higher accuracy than the baseline model. The novelty lies in the data preparation process, where numerical features were imputed using the mean, categorical features were imputed using chi-square methods, and normalization was applied. This research study involves transforming the original datasets and comparative model analysis of four Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Random Forest (RF) cross-validation methodologies to heart disease open datasets. The objective is to easily identify the average accuracy of model predictions and subsequently make recommendations for model selection based on data preprocessing cross-validation model increased (5 to 14%) more than baseline model for best model selection. From comparing each model’s accuracy scores, it is found that the logistic regression and k-nearest neighbor models achieved the highest accuracy of 81% among the four models when single accuracy is a concern. However, the random forest model summary statistics attained an F1 score of 95%, precision (96%), and recall (97%), indicating the highest overall macro accuracy score. These findings can be further compared using learning curve validation. Conversely, the logistic regression model exhibited the lowest accuracy of 84% among the four machine learning models. However, this research does not cover hyperparameter optimization, which could potentially improve model performance.
format Article
id doaj-art-cdd117aeff6f481cbbc264917a3abb5d
institution OA Journals
issn 2045-2322
language English
publishDate 2025-04-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-cdd117aeff6f481cbbc264917a3abb5d2025-08-20T02:27:53ZengNature PortfolioScientific Reports2045-23222025-04-0115111410.1038/s41598-025-93675-1Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracyYagyanath Rimal0Navneet Sharma1Siddhartha Paudel2Abeer Alsadoon3Madhav Parsad Koirala4Sumeet Gill5IIS (Deemed to be University)IIS (Deemed to be University)IOEWestern Sydney University (WSU)Pokhara UniversityMaharshi Dayanand UniversityAbstract This primary research paper emphasizes cross-validation, where data samples are reshuffled in each iteration to form randomized subsets divided into n folds. This method improves model performance and achieves higher accuracy than the baseline model. The novelty lies in the data preparation process, where numerical features were imputed using the mean, categorical features were imputed using chi-square methods, and normalization was applied. This research study involves transforming the original datasets and comparative model analysis of four Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Random Forest (RF) cross-validation methodologies to heart disease open datasets. The objective is to easily identify the average accuracy of model predictions and subsequently make recommendations for model selection based on data preprocessing cross-validation model increased (5 to 14%) more than baseline model for best model selection. From comparing each model’s accuracy scores, it is found that the logistic regression and k-nearest neighbor models achieved the highest accuracy of 81% among the four models when single accuracy is a concern. However, the random forest model summary statistics attained an F1 score of 95%, precision (96%), and recall (97%), indicating the highest overall macro accuracy score. These findings can be further compared using learning curve validation. Conversely, the logistic regression model exhibited the lowest accuracy of 84% among the four machine learning models. However, this research does not cover hyperparameter optimization, which could potentially improve model performance.https://doi.org/10.1038/s41598-025-93675-1Machine learningCross-validationAccuracy-precisionLearning curveHealth informatics
spellingShingle Yagyanath Rimal
Navneet Sharma
Siddhartha Paudel
Abeer Alsadoon
Madhav Parsad Koirala
Sumeet Gill
Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy
Scientific Reports
Machine learning
Cross-validation
Accuracy-precision
Learning curve
Health informatics
title Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy
title_full Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy
title_fullStr Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy
title_full_unstemmed Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy
title_short Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy
title_sort comparative analysis of heart disease prediction using logistic regression svm knn and random forest with cross validation for improved accuracy
topic Machine learning
Cross-validation
Accuracy-precision
Learning curve
Health informatics
url https://doi.org/10.1038/s41598-025-93675-1
work_keys_str_mv AT yagyanathrimal comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy
AT navneetsharma comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy
AT siddharthapaudel comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy
AT abeeralsadoon comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy
AT madhavparsadkoirala comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy
AT sumeetgill comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy