Characterisation of cardiovascular disease (CVD) incidence and machine learning risk prediction in middle-aged and elderly populations: data from the China health and retirement longitudinal study (CHARLS)
Abstract Background Due to the ageing population and evolving lifestyles occurring in China, middle-aged and elderly populations have become high-risk groups for cardiovascular disease (CVD). The aim of this study was to analyse the incidence characteristics of CVD in these populations and develop a...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2025-02-01
|
Series: | BMC Public Health |
Subjects: | |
Online Access: | https://doi.org/10.1186/s12889-025-21609-7 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Abstract Background Due to the ageing population and evolving lifestyles occurring in China, middle-aged and elderly populations have become high-risk groups for cardiovascular disease (CVD). The aim of this study was to analyse the incidence characteristics of CVD in these populations and develop a prediction model by using data from the China Health and Retirement Longitudinal Study (CHARLS). Methods We used follow-up data from the CHARLS to analyse CVD incidence in the Chinese middle-aged and elderly population over a time span of 9 years. Five machine learning (ML) algorithms were employed for risk prediction. Data preprocessing included missing value imputation via random forest. Feature selection was performed using the Least Absolute Shrinkage and Selection Operator (Lasso CV) method with cross-validation prior to model training. The application of the synthetic minority over-sampling technique (SMOTE) to address class imbalance. Model performance was evaluated via analyses including the area under the ROC curve (AUC), precision, recall, F1 score, and SHAP plots for interpretability. Results In accordance with the exclusion criteria, 12,580, 12,061, 11,545, and 11,619 participants were enrolled in four follow-up rounds. The cumulative incidence (CI) of CVD at 2, 4, 7, and 9 years was 2.846%, 8.971%, 17.869% and 20.518%,, respectively. Significant differences in CVD incidence were observed across gender, age, ethnicity, and region, with higher rates observed in females and in the northeast region. Ultimately, 8,080 participants and 24 features were analysed for CVD risk prediction. Five ML models were built based on these features. Although the LGB model achieves an AUC of 0.818, indicating strong overall performance, its F1 score and recall rate are relatively low, at 0.509 and 43.1%, respectively. Shapley additive explanations (SHAP) analyses revealed the importance of key features, such as night sleep duration, TG levels, and waist circumference, in predicting outcomes, and highlighted the nonlinear relationships between these features and CVD risk. Conclusions Gender, age, ethnicity, and region are significant factors influencing CVD incidence. Although the LGB model demonstrates good overall performance, its low F1 score and recall rate reveal limitations in identifying high-risk cardiovascular disease patients. |
---|---|
ISSN: | 1471-2458 |