Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy
Abstract This primary research paper emphasizes cross-validation, where data samples are reshuffled in each iteration to form randomized subsets divided into n folds. This method improves model performance and achieves higher accuracy than the baseline model. The novelty lies in the data preparation...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-04-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-93675-1 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850146324775698432 |
|---|---|
| author | Yagyanath Rimal Navneet Sharma Siddhartha Paudel Abeer Alsadoon Madhav Parsad Koirala Sumeet Gill |
| author_facet | Yagyanath Rimal Navneet Sharma Siddhartha Paudel Abeer Alsadoon Madhav Parsad Koirala Sumeet Gill |
| author_sort | Yagyanath Rimal |
| collection | DOAJ |
| description | Abstract This primary research paper emphasizes cross-validation, where data samples are reshuffled in each iteration to form randomized subsets divided into n folds. This method improves model performance and achieves higher accuracy than the baseline model. The novelty lies in the data preparation process, where numerical features were imputed using the mean, categorical features were imputed using chi-square methods, and normalization was applied. This research study involves transforming the original datasets and comparative model analysis of four Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Random Forest (RF) cross-validation methodologies to heart disease open datasets. The objective is to easily identify the average accuracy of model predictions and subsequently make recommendations for model selection based on data preprocessing cross-validation model increased (5 to 14%) more than baseline model for best model selection. From comparing each model’s accuracy scores, it is found that the logistic regression and k-nearest neighbor models achieved the highest accuracy of 81% among the four models when single accuracy is a concern. However, the random forest model summary statistics attained an F1 score of 95%, precision (96%), and recall (97%), indicating the highest overall macro accuracy score. These findings can be further compared using learning curve validation. Conversely, the logistic regression model exhibited the lowest accuracy of 84% among the four machine learning models. However, this research does not cover hyperparameter optimization, which could potentially improve model performance. |
| format | Article |
| id | doaj-art-cdd117aeff6f481cbbc264917a3abb5d |
| institution | OA Journals |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-cdd117aeff6f481cbbc264917a3abb5d2025-08-20T02:27:53ZengNature PortfolioScientific Reports2045-23222025-04-0115111410.1038/s41598-025-93675-1Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracyYagyanath Rimal0Navneet Sharma1Siddhartha Paudel2Abeer Alsadoon3Madhav Parsad Koirala4Sumeet Gill5IIS (Deemed to be University)IIS (Deemed to be University)IOEWestern Sydney University (WSU)Pokhara UniversityMaharshi Dayanand UniversityAbstract This primary research paper emphasizes cross-validation, where data samples are reshuffled in each iteration to form randomized subsets divided into n folds. This method improves model performance and achieves higher accuracy than the baseline model. The novelty lies in the data preparation process, where numerical features were imputed using the mean, categorical features were imputed using chi-square methods, and normalization was applied. This research study involves transforming the original datasets and comparative model analysis of four Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Random Forest (RF) cross-validation methodologies to heart disease open datasets. The objective is to easily identify the average accuracy of model predictions and subsequently make recommendations for model selection based on data preprocessing cross-validation model increased (5 to 14%) more than baseline model for best model selection. From comparing each model’s accuracy scores, it is found that the logistic regression and k-nearest neighbor models achieved the highest accuracy of 81% among the four models when single accuracy is a concern. However, the random forest model summary statistics attained an F1 score of 95%, precision (96%), and recall (97%), indicating the highest overall macro accuracy score. These findings can be further compared using learning curve validation. Conversely, the logistic regression model exhibited the lowest accuracy of 84% among the four machine learning models. However, this research does not cover hyperparameter optimization, which could potentially improve model performance.https://doi.org/10.1038/s41598-025-93675-1Machine learningCross-validationAccuracy-precisionLearning curveHealth informatics |
| spellingShingle | Yagyanath Rimal Navneet Sharma Siddhartha Paudel Abeer Alsadoon Madhav Parsad Koirala Sumeet Gill Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy Scientific Reports Machine learning Cross-validation Accuracy-precision Learning curve Health informatics |
| title | Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy |
| title_full | Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy |
| title_fullStr | Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy |
| title_full_unstemmed | Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy |
| title_short | Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy |
| title_sort | comparative analysis of heart disease prediction using logistic regression svm knn and random forest with cross validation for improved accuracy |
| topic | Machine learning Cross-validation Accuracy-precision Learning curve Health informatics |
| url | https://doi.org/10.1038/s41598-025-93675-1 |
| work_keys_str_mv | AT yagyanathrimal comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy AT navneetsharma comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy AT siddharthapaudel comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy AT abeeralsadoon comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy AT madhavparsadkoirala comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy AT sumeetgill comparativeanalysisofheartdiseasepredictionusinglogisticregressionsvmknnandrandomforestwithcrossvalidationforimprovedaccuracy |