Machine learning algorithms to predict stroke in China based on causal inference of time series analysis

Abstract Importance Identifying and managing high-risk populations for stroke in a targeted manner is a key area of preventive healthcare. Objective To assess machine learning (ML) models and causal inference of time series analysis for predicting stroke clinically meaningful model. Design This is a...

Full description

Saved in:
Bibliographic Details
Main Authors: Qizhi Zheng, Ayang Zhao, Xinzhu Wang, Yanhong Bai, Zikun Wang, Xiuying Wang, Xianzhang Zeng, Guanghui Dong
Format: Article
Language:English
Published: BMC 2025-05-01
Series:BMC Neurology
Subjects:
Online Access:https://doi.org/10.1186/s12883-025-04261-x
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849704719366225920
author Qizhi Zheng
Ayang Zhao
Xinzhu Wang
Yanhong Bai
Zikun Wang
Xiuying Wang
Xianzhang Zeng
Guanghui Dong
author_facet Qizhi Zheng
Ayang Zhao
Xinzhu Wang
Yanhong Bai
Zikun Wang
Xiuying Wang
Xianzhang Zeng
Guanghui Dong
author_sort Qizhi Zheng
collection DOAJ
description Abstract Importance Identifying and managing high-risk populations for stroke in a targeted manner is a key area of preventive healthcare. Objective To assess machine learning (ML) models and causal inference of time series analysis for predicting stroke clinically meaningful model. Design This is a retrospective cohort study and data is from China Health and Retirement Longitudinal Study (CHARLS) assessed 11,789 adults in China from 2011 to 2018. Data analysis was performed from June 1 to December 1, 2024. Setting CHARLS adopts a multi-stage probability sampling method, covering samples from 28 provinces, and collects data every two years through computer-aided personal interviews (CAPI). Participants This study employed a combination of Vector Autoregression (VAR) model and Graph Neural Networks (GNN) to systematically construct dynamic causal inference. Multiple classic classification algorithms were compared, including Random Forest, Logistic Regression, XGBoost, Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Gradient Boosting, and Multi-Layer Perceptron (MLP). The Synthetic Minority Oversampling Technique (SMOTE) algorithm was used to undersample a small number of samples and employed Stratified K-fold Cross Validation. Main Outcome(s) and Measure(s) AUC (Area Under the Curve), Accuracy, Precision, Recall, F1 Score, and Matthews Correlation Coefficient (MCC). Results This study included a total of 11,789 participants, including 6,334 females (53.73%) and 5,455 males (46.27%), with an average age of 65 years. Introduction of dynamic causal inference features has significantly improved the performance of almost all models. The area under the ROC curve of each model ranged from 0.78 to 0.83, indicating significant difference (P < 0.01). Among all the models, the Gradient Boosting model demonstrated the highest performance and stability. Model explanation and feature importance analysis generated model interpretation that illustrated significant contributors associated with risks of stroke. Conclusions and Relevance This study proposes a stroke risk prediction method that combines dynamic causal inference with machine learning models, significantly improving prediction accuracy and revealing key health factors that affect stroke. The research results indicate that dynamic causal inference features have important value in predicting stroke risk, especially in capturing the impact of changes in health status over time on stroke risk. By further optimizing the model and introducing more variables, this study provides theoretical basis and practical guidance for future stroke prevention and intervention strategies. Trial registration IRB00001052-11015.1.2.
format Article
id doaj-art-fa6cf6059af14d1a89daab6b0e7fcdf2
institution DOAJ
issn 1471-2377
language English
publishDate 2025-05-01
publisher BMC
record_format Article
series BMC Neurology
spelling doaj-art-fa6cf6059af14d1a89daab6b0e7fcdf22025-08-20T03:16:40ZengBMCBMC Neurology1471-23772025-05-0125111210.1186/s12883-025-04261-xMachine learning algorithms to predict stroke in China based on causal inference of time series analysisQizhi Zheng0Ayang Zhao1Xinzhu Wang2Yanhong Bai3Zikun Wang4Xiuying Wang5Xianzhang Zeng6Guanghui Dong7College of Computer and Control Engineering, Northeast Forestry UniversitySchool of Medicine and Health, Key Laboratory of Micro-systems and Micro-structures Manufacturing (Ministry of Education), Harbin Institute of TechnologyCollege of Computer and Control Engineering, Northeast Forestry UniversityCollege of Computer and Control Engineering, Northeast Forestry UniversityCollege of Computer and Control Engineering, Northeast Forestry UniversityCollege of Computer and Control Engineering, Northeast Forestry UniversityDepartment of Anesthesiology, Chongqing University Cancer HospitalCollege of Computer and Control Engineering, Northeast Forestry UniversityAbstract Importance Identifying and managing high-risk populations for stroke in a targeted manner is a key area of preventive healthcare. Objective To assess machine learning (ML) models and causal inference of time series analysis for predicting stroke clinically meaningful model. Design This is a retrospective cohort study and data is from China Health and Retirement Longitudinal Study (CHARLS) assessed 11,789 adults in China from 2011 to 2018. Data analysis was performed from June 1 to December 1, 2024. Setting CHARLS adopts a multi-stage probability sampling method, covering samples from 28 provinces, and collects data every two years through computer-aided personal interviews (CAPI). Participants This study employed a combination of Vector Autoregression (VAR) model and Graph Neural Networks (GNN) to systematically construct dynamic causal inference. Multiple classic classification algorithms were compared, including Random Forest, Logistic Regression, XGBoost, Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Gradient Boosting, and Multi-Layer Perceptron (MLP). The Synthetic Minority Oversampling Technique (SMOTE) algorithm was used to undersample a small number of samples and employed Stratified K-fold Cross Validation. Main Outcome(s) and Measure(s) AUC (Area Under the Curve), Accuracy, Precision, Recall, F1 Score, and Matthews Correlation Coefficient (MCC). Results This study included a total of 11,789 participants, including 6,334 females (53.73%) and 5,455 males (46.27%), with an average age of 65 years. Introduction of dynamic causal inference features has significantly improved the performance of almost all models. The area under the ROC curve of each model ranged from 0.78 to 0.83, indicating significant difference (P < 0.01). Among all the models, the Gradient Boosting model demonstrated the highest performance and stability. Model explanation and feature importance analysis generated model interpretation that illustrated significant contributors associated with risks of stroke. Conclusions and Relevance This study proposes a stroke risk prediction method that combines dynamic causal inference with machine learning models, significantly improving prediction accuracy and revealing key health factors that affect stroke. The research results indicate that dynamic causal inference features have important value in predicting stroke risk, especially in capturing the impact of changes in health status over time on stroke risk. By further optimizing the model and introducing more variables, this study provides theoretical basis and practical guidance for future stroke prevention and intervention strategies. Trial registration IRB00001052-11015.1.2.https://doi.org/10.1186/s12883-025-04261-xMachine learningDynamic causal inferenceStroke risk predictionGradient boosting
spellingShingle Qizhi Zheng
Ayang Zhao
Xinzhu Wang
Yanhong Bai
Zikun Wang
Xiuying Wang
Xianzhang Zeng
Guanghui Dong
Machine learning algorithms to predict stroke in China based on causal inference of time series analysis
BMC Neurology
Machine learning
Dynamic causal inference
Stroke risk prediction
Gradient boosting
title Machine learning algorithms to predict stroke in China based on causal inference of time series analysis
title_full Machine learning algorithms to predict stroke in China based on causal inference of time series analysis
title_fullStr Machine learning algorithms to predict stroke in China based on causal inference of time series analysis
title_full_unstemmed Machine learning algorithms to predict stroke in China based on causal inference of time series analysis
title_short Machine learning algorithms to predict stroke in China based on causal inference of time series analysis
title_sort machine learning algorithms to predict stroke in china based on causal inference of time series analysis
topic Machine learning
Dynamic causal inference
Stroke risk prediction
Gradient boosting
url https://doi.org/10.1186/s12883-025-04261-x
work_keys_str_mv AT qizhizheng machinelearningalgorithmstopredictstrokeinchinabasedoncausalinferenceoftimeseriesanalysis
AT ayangzhao machinelearningalgorithmstopredictstrokeinchinabasedoncausalinferenceoftimeseriesanalysis
AT xinzhuwang machinelearningalgorithmstopredictstrokeinchinabasedoncausalinferenceoftimeseriesanalysis
AT yanhongbai machinelearningalgorithmstopredictstrokeinchinabasedoncausalinferenceoftimeseriesanalysis
AT zikunwang machinelearningalgorithmstopredictstrokeinchinabasedoncausalinferenceoftimeseriesanalysis
AT xiuyingwang machinelearningalgorithmstopredictstrokeinchinabasedoncausalinferenceoftimeseriesanalysis
AT xianzhangzeng machinelearningalgorithmstopredictstrokeinchinabasedoncausalinferenceoftimeseriesanalysis
AT guanghuidong machinelearningalgorithmstopredictstrokeinchinabasedoncausalinferenceoftimeseriesanalysis