Characterisation of cardiovascular disease (CVD) incidence and machine learning risk prediction in middle-aged and elderly populations: data from the China health and retirement longitudinal study (CHARLS)

Abstract Background Due to the ageing population and evolving lifestyles occurring in China, middle-aged and elderly populations have become high-risk groups for cardiovascular disease (CVD). The aim of this study was to analyse the incidence characteristics of CVD in these populations and develop a...

Full description

Saved in:
Bibliographic Details
Main Authors: Qing Huang, Zihao Jiang, Bo Shi, Jiaxu Meng, Li Shu, Fuyong Hu, Jing Mi
Format: Article
Language:English
Published: BMC 2025-02-01
Series:BMC Public Health
Subjects:
Online Access:https://doi.org/10.1186/s12889-025-21609-7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823861529247744000
author Qing Huang
Zihao Jiang
Bo Shi
Jiaxu Meng
Li Shu
Fuyong Hu
Jing Mi
author_facet Qing Huang
Zihao Jiang
Bo Shi
Jiaxu Meng
Li Shu
Fuyong Hu
Jing Mi
author_sort Qing Huang
collection DOAJ
description Abstract Background Due to the ageing population and evolving lifestyles occurring in China, middle-aged and elderly populations have become high-risk groups for cardiovascular disease (CVD). The aim of this study was to analyse the incidence characteristics of CVD in these populations and develop a prediction model by using data from the China Health and Retirement Longitudinal Study (CHARLS). Methods We used follow-up data from the CHARLS to analyse CVD incidence in the Chinese middle-aged and elderly population over a time span of 9 years. Five machine learning (ML) algorithms were employed for risk prediction. Data preprocessing included missing value imputation via random forest. Feature selection was performed using the Least Absolute Shrinkage and Selection Operator (Lasso CV) method with cross-validation prior to model training. The application of the synthetic minority over-sampling technique (SMOTE) to address class imbalance. Model performance was evaluated via analyses including the area under the ROC curve (AUC), precision, recall, F1 score, and SHAP plots for interpretability. Results In accordance with the exclusion criteria, 12,580, 12,061, 11,545, and 11,619 participants were enrolled in four follow-up rounds. The cumulative incidence (CI) of CVD at 2, 4, 7, and 9 years was 2.846%, 8.971%, 17.869% and 20.518%,, respectively. Significant differences in CVD incidence were observed across gender, age, ethnicity, and region, with higher rates observed in females and in the northeast region. Ultimately, 8,080 participants and 24 features were analysed for CVD risk prediction. Five ML models were built based on these features. Although the LGB model achieves an AUC of 0.818, indicating strong overall performance, its F1 score and recall rate are relatively low, at 0.509 and 43.1%, respectively. Shapley additive explanations (SHAP) analyses revealed the importance of key features, such as night sleep duration, TG levels, and waist circumference, in predicting outcomes, and highlighted the nonlinear relationships between these features and CVD risk. Conclusions Gender, age, ethnicity, and region are significant factors influencing CVD incidence. Although the LGB model demonstrates good overall performance, its low F1 score and recall rate reveal limitations in identifying high-risk cardiovascular disease patients.
format Article
id doaj-art-1dd22b63fb5f48e5a860d9a8ff687e80
institution Kabale University
issn 1471-2458
language English
publishDate 2025-02-01
publisher BMC
record_format Article
series BMC Public Health
spelling doaj-art-1dd22b63fb5f48e5a860d9a8ff687e802025-02-09T12:57:59ZengBMCBMC Public Health1471-24582025-02-0125111210.1186/s12889-025-21609-7Characterisation of cardiovascular disease (CVD) incidence and machine learning risk prediction in middle-aged and elderly populations: data from the China health and retirement longitudinal study (CHARLS)Qing Huang0Zihao Jiang1Bo Shi2Jiaxu Meng3Li Shu4Fuyong Hu5Jing Mi6School of Public Health, Bengbu Medical UniversitySchool of Public Health, Bengbu Medical UniversitySchool of Medical Imaging, Bengbu Medical UniversitySchool of Medical Imaging, Bengbu Medical UniversitySchool of Public Health, Bengbu Medical UniversitySchool of Public Health, Bengbu Medical UniversitySchool of Public Health, Bengbu Medical UniversityAbstract Background Due to the ageing population and evolving lifestyles occurring in China, middle-aged and elderly populations have become high-risk groups for cardiovascular disease (CVD). The aim of this study was to analyse the incidence characteristics of CVD in these populations and develop a prediction model by using data from the China Health and Retirement Longitudinal Study (CHARLS). Methods We used follow-up data from the CHARLS to analyse CVD incidence in the Chinese middle-aged and elderly population over a time span of 9 years. Five machine learning (ML) algorithms were employed for risk prediction. Data preprocessing included missing value imputation via random forest. Feature selection was performed using the Least Absolute Shrinkage and Selection Operator (Lasso CV) method with cross-validation prior to model training. The application of the synthetic minority over-sampling technique (SMOTE) to address class imbalance. Model performance was evaluated via analyses including the area under the ROC curve (AUC), precision, recall, F1 score, and SHAP plots for interpretability. Results In accordance with the exclusion criteria, 12,580, 12,061, 11,545, and 11,619 participants were enrolled in four follow-up rounds. The cumulative incidence (CI) of CVD at 2, 4, 7, and 9 years was 2.846%, 8.971%, 17.869% and 20.518%,, respectively. Significant differences in CVD incidence were observed across gender, age, ethnicity, and region, with higher rates observed in females and in the northeast region. Ultimately, 8,080 participants and 24 features were analysed for CVD risk prediction. Five ML models were built based on these features. Although the LGB model achieves an AUC of 0.818, indicating strong overall performance, its F1 score and recall rate are relatively low, at 0.509 and 43.1%, respectively. Shapley additive explanations (SHAP) analyses revealed the importance of key features, such as night sleep duration, TG levels, and waist circumference, in predicting outcomes, and highlighted the nonlinear relationships between these features and CVD risk. Conclusions Gender, age, ethnicity, and region are significant factors influencing CVD incidence. Although the LGB model demonstrates good overall performance, its low F1 score and recall rate reveal limitations in identifying high-risk cardiovascular disease patients.https://doi.org/10.1186/s12889-025-21609-7Cardiovascular diseaseMiddle-aged and elderly individualsMorbidity characteristicsMachine learningPredictive modelling
spellingShingle Qing Huang
Zihao Jiang
Bo Shi
Jiaxu Meng
Li Shu
Fuyong Hu
Jing Mi
Characterisation of cardiovascular disease (CVD) incidence and machine learning risk prediction in middle-aged and elderly populations: data from the China health and retirement longitudinal study (CHARLS)
BMC Public Health
Cardiovascular disease
Middle-aged and elderly individuals
Morbidity characteristics
Machine learning
Predictive modelling
title Characterisation of cardiovascular disease (CVD) incidence and machine learning risk prediction in middle-aged and elderly populations: data from the China health and retirement longitudinal study (CHARLS)
title_full Characterisation of cardiovascular disease (CVD) incidence and machine learning risk prediction in middle-aged and elderly populations: data from the China health and retirement longitudinal study (CHARLS)
title_fullStr Characterisation of cardiovascular disease (CVD) incidence and machine learning risk prediction in middle-aged and elderly populations: data from the China health and retirement longitudinal study (CHARLS)
title_full_unstemmed Characterisation of cardiovascular disease (CVD) incidence and machine learning risk prediction in middle-aged and elderly populations: data from the China health and retirement longitudinal study (CHARLS)
title_short Characterisation of cardiovascular disease (CVD) incidence and machine learning risk prediction in middle-aged and elderly populations: data from the China health and retirement longitudinal study (CHARLS)
title_sort characterisation of cardiovascular disease cvd incidence and machine learning risk prediction in middle aged and elderly populations data from the china health and retirement longitudinal study charls
topic Cardiovascular disease
Middle-aged and elderly individuals
Morbidity characteristics
Machine learning
Predictive modelling
url https://doi.org/10.1186/s12889-025-21609-7
work_keys_str_mv AT qinghuang characterisationofcardiovasculardiseasecvdincidenceandmachinelearningriskpredictioninmiddleagedandelderlypopulationsdatafromthechinahealthandretirementlongitudinalstudycharls
AT zihaojiang characterisationofcardiovasculardiseasecvdincidenceandmachinelearningriskpredictioninmiddleagedandelderlypopulationsdatafromthechinahealthandretirementlongitudinalstudycharls
AT boshi characterisationofcardiovasculardiseasecvdincidenceandmachinelearningriskpredictioninmiddleagedandelderlypopulationsdatafromthechinahealthandretirementlongitudinalstudycharls
AT jiaxumeng characterisationofcardiovasculardiseasecvdincidenceandmachinelearningriskpredictioninmiddleagedandelderlypopulationsdatafromthechinahealthandretirementlongitudinalstudycharls
AT lishu characterisationofcardiovasculardiseasecvdincidenceandmachinelearningriskpredictioninmiddleagedandelderlypopulationsdatafromthechinahealthandretirementlongitudinalstudycharls
AT fuyonghu characterisationofcardiovasculardiseasecvdincidenceandmachinelearningriskpredictioninmiddleagedandelderlypopulationsdatafromthechinahealthandretirementlongitudinalstudycharls
AT jingmi characterisationofcardiovasculardiseasecvdincidenceandmachinelearningriskpredictioninmiddleagedandelderlypopulationsdatafromthechinahealthandretirementlongitudinalstudycharls