Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy

Abstract This primary research paper emphasizes cross-validation, where data samples are reshuffled in each iteration to form randomized subsets divided into n folds. This method improves model performance and achieves higher accuracy than the baseline model. The novelty lies in the data preparation...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yagyanath Rimal, Navneet Sharma, Siddhartha Paudel, Abeer Alsadoon, Madhav Parsad Koirala, Sumeet Gill
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-04-01
Series:	Scientific Reports
Subjects:	Machine learning Cross-validation Accuracy-precision Learning curve Health informatics
Online Access:	https://doi.org/10.1038/s41598-025-93675-1
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850146324775698432
author	Yagyanath Rimal Navneet Sharma Siddhartha Paudel Abeer Alsadoon Madhav Parsad Koirala Sumeet Gill
author_facet	Yagyanath Rimal Navneet Sharma Siddhartha Paudel Abeer Alsadoon Madhav Parsad Koirala Sumeet Gill
author_sort	Yagyanath Rimal
collection	DOAJ
description	Abstract This primary research paper emphasizes cross-validation, where data samples are reshuffled in each iteration to form randomized subsets divided into n folds. This method improves model performance and achieves higher accuracy than the baseline model. The novelty lies in the data preparation process, where numerical features were imputed using the mean, categorical features were imputed using chi-square methods, and normalization was applied. This research study involves transforming the original datasets and comparative model analysis of four Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Random Forest (RF) cross-validation methodologies to heart disease open datasets. The objective is to easily identify the average accuracy of model predictions and subsequently make recommendations for model selection based on data preprocessing cross-validation model increased (5 to 14%) more than baseline model for best model selection. From comparing each model’s accuracy scores, it is found that the logistic regression and k-nearest neighbor models achieved the highest accuracy of 81% among the four models when single accuracy is a concern. However, the random forest model summary statistics attained an F1 score of 95%, precision (96%), and recall (97%), indicating the highest overall macro accuracy score. These findings can be further compared using learning curve validation. Conversely, the logistic regression model exhibited the lowest accuracy of 84% among the four machine learning models. However, this research does not cover hyperparameter optimization, which could potentially improve model performance.
format	Article
id	doaj-art-cdd117aeff6f481cbbc264917a3abb5d
institution	OA Journals
issn	2045-2322
language	English
publishDate	2025-04-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-cdd117aeff6f481cbbc264917a3abb5d2025-08-20T02:27:53ZengNature PortfolioScientific Reports2045-23222025-04-0115111410.1038/s41598-025-93675-1Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracyYagyanath Rimal0Navneet Sharma1Siddhartha Paudel2Abeer Alsadoon3Madhav Parsad Koirala4Sumeet Gill5IIS (Deemed to be University)IIS (Deemed to be University)IOEWestern Sydney University (WSU)Pokhara UniversityMaharshi Dayanand UniversityAbstract This primary research paper emphasizes cross-validation, where data samples are reshuffled in each iteration to form randomized subsets divided into n folds. This method improves model performance and achieves higher accuracy than the baseline model. The novelty lies in the data preparation process, where numerical features were imputed using the mean, categorical features were imputed using chi-square methods, and normalization was applied. This research study involves transforming the original datasets and comparative model analysis of four Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Random Forest (RF) cross-validation methodologies to heart disease open datasets. The objective is to easily identify the average accuracy of model predictions and subsequently make recommendations for model selection based on data preprocessing cross-validation model increased (5 to 14%) more than baseline model for best model selection. From comparing each model’s accuracy scores, it is found that the logistic regression and k-nearest neighbor models achieved the highest accuracy of 81% among the four models when single accuracy is a concern. However, the random forest model summary statistics attained an F1 score of 95%, precision (96%), and recall (97%), indicating the highest overall macro accuracy score. These findings can be further compared using learning curve validation. Conversely, the logistic regression model exhibited the lowest accuracy of 84% among the four machine learning models. However, this research does not cover hyperparameter optimization, which could potentially improve model performance.https://doi.org/10.1038/s41598-025-93675-1Machine learningCross-validationAccuracy-precisionLearning curveHealth informatics
spellingShingle	Yagyanath Rimal Navneet Sharma Siddhartha Paudel Abeer Alsadoon Madhav Parsad Koirala Sumeet Gill Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy Scientific Reports Machine learning Cross-validation Accuracy-precision Learning curve Health informatics
title	Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy
title_full	Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy
title_fullStr	Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy
title_full_unstemmed	Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy
title_short	Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy
title_sort	comparative analysis of heart disease prediction using logistic regression svm knn and random forest with cross validation for improved accuracy
topic	Machine learning Cross-validation Accuracy-precision Learning curve Health informatics
url	https://doi.org/10.1038/s41598-025-93675-1
work_keys_str_mv	AT yagyanathrimal comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy AT navneetsharma comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy AT siddharthapaudel comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy AT abeeralsadoon comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy AT madhavparsadkoirala comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy AT sumeetgill comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy

Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy

Similar Items