Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort
Objective To develop and validate machine-learning models that predict the risk of pan-cancer incidence using demographic, questionnaire and routine health check-up data in a large Asian population.Methods and analysis This study is a prospective cohort study including 433 549 participants from the...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMJ Publishing Group
2024-07-01
|
Series: | BMJ Oncology |
Online Access: | https://bmjoncology.bmj.com/content/3/1/e000087.full |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832581907792003072 |
---|---|
author | Chi-Pang Wen Xifeng Wu Qingfeng Hu Huakang Tu Shan Pou Tsai David Ta-Wei Chu |
author_facet | Chi-Pang Wen Xifeng Wu Qingfeng Hu Huakang Tu Shan Pou Tsai David Ta-Wei Chu |
author_sort | Chi-Pang Wen |
collection | DOAJ |
description | Objective To develop and validate machine-learning models that predict the risk of pan-cancer incidence using demographic, questionnaire and routine health check-up data in a large Asian population.Methods and analysis This study is a prospective cohort study including 433 549 participants from the prospective MJ cohort including a male cohort (n=208 599) and a female cohort (n=224 950).Results During an 8-year median follow-up, 5143 cancers occurred in males and 4764 in females. Compared with Lasso-Cox and Random Survival Forests, XGBoost showed superior performance for both cohorts. The XGBoost model with all 155 features in males and 160 features in females achieved an area under the curve (AUC) of 0.877 and 0.750, respectively. Light models with 31 variables for males and 11 variables for females showed comparable performance: an AUC of 0.876 (95% CI 0.858 to 0.894) in the overall population and 0.818 (95% CI 0.795 to 0.841) in those aged ≥40 years in the male cohort and an AUC of 0.746 (95% CI 0.721 to 0.771) in the overall population and 0.641 (95% CI 0.605 to 0.677) in those aged ≥40 years in the female cohort. High-risk individuals have at least ninefold higher risk of pan-cancer incidence compared with low-risk groups.Conclusion We developed and internally validated the first machine-learning models based on routine health check-up data to predict pan-cancer risk in the general population and achieved generally good discriminatory ability with a small set of predictors. External validation is warranted before the implementation of our risk model in clinical practice. |
format | Article |
id | doaj-art-8c4a2105e3c34b53be975b89bafdb473 |
institution | Kabale University |
issn | 2752-7948 |
language | English |
publishDate | 2024-07-01 |
publisher | BMJ Publishing Group |
record_format | Article |
series | BMJ Oncology |
spelling | doaj-art-8c4a2105e3c34b53be975b89bafdb4732025-01-30T09:35:14ZengBMJ Publishing GroupBMJ Oncology2752-79482024-07-013110.1136/bmjonc-2023-000087Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohortChi-Pang Wen0Xifeng Wu1Qingfeng Hu2Huakang Tu3Shan Pou Tsai4David Ta-Wei Chu5National Institute for Data Science in Health and Medicine, Zhejiang University, Hangzhou, Zhejiang, ChinaSchool of Public Health, Zhejiang Medical University, Hangzhou, ChinaDepartment of Big Data in Health Science School of Public Health, and Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, ChinaDepartment of Big Data in Health Science School of Public Health, and Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, ChinaMJ Health Research Foundation, Taipei, TaiwanMJ Health Management Center, Taipei, TaiwanObjective To develop and validate machine-learning models that predict the risk of pan-cancer incidence using demographic, questionnaire and routine health check-up data in a large Asian population.Methods and analysis This study is a prospective cohort study including 433 549 participants from the prospective MJ cohort including a male cohort (n=208 599) and a female cohort (n=224 950).Results During an 8-year median follow-up, 5143 cancers occurred in males and 4764 in females. Compared with Lasso-Cox and Random Survival Forests, XGBoost showed superior performance for both cohorts. The XGBoost model with all 155 features in males and 160 features in females achieved an area under the curve (AUC) of 0.877 and 0.750, respectively. Light models with 31 variables for males and 11 variables for females showed comparable performance: an AUC of 0.876 (95% CI 0.858 to 0.894) in the overall population and 0.818 (95% CI 0.795 to 0.841) in those aged ≥40 years in the male cohort and an AUC of 0.746 (95% CI 0.721 to 0.771) in the overall population and 0.641 (95% CI 0.605 to 0.677) in those aged ≥40 years in the female cohort. High-risk individuals have at least ninefold higher risk of pan-cancer incidence compared with low-risk groups.Conclusion We developed and internally validated the first machine-learning models based on routine health check-up data to predict pan-cancer risk in the general population and achieved generally good discriminatory ability with a small set of predictors. External validation is warranted before the implementation of our risk model in clinical practice.https://bmjoncology.bmj.com/content/3/1/e000087.full |
spellingShingle | Chi-Pang Wen Xifeng Wu Qingfeng Hu Huakang Tu Shan Pou Tsai David Ta-Wei Chu Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort BMJ Oncology |
title | Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort |
title_full | Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort |
title_fullStr | Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort |
title_full_unstemmed | Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort |
title_short | Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort |
title_sort | novel machine learning algorithm in risk prediction model for pan cancer risk application in a large prospective cohort |
url | https://bmjoncology.bmj.com/content/3/1/e000087.full |
work_keys_str_mv | AT chipangwen novelmachinelearningalgorithminriskpredictionmodelforpancancerriskapplicationinalargeprospectivecohort AT xifengwu novelmachinelearningalgorithminriskpredictionmodelforpancancerriskapplicationinalargeprospectivecohort AT qingfenghu novelmachinelearningalgorithminriskpredictionmodelforpancancerriskapplicationinalargeprospectivecohort AT huakangtu novelmachinelearningalgorithminriskpredictionmodelforpancancerriskapplicationinalargeprospectivecohort AT shanpoutsai novelmachinelearningalgorithminriskpredictionmodelforpancancerriskapplicationinalargeprospectivecohort AT davidtaweichu novelmachinelearningalgorithminriskpredictionmodelforpancancerriskapplicationinalargeprospectivecohort |