Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort

Objective To develop and validate machine-learning models that predict the risk of pan-cancer incidence using demographic, questionnaire and routine health check-up data in a large Asian population.Methods and analysis This study is a prospective cohort study including 433 549 participants from the...

Full description

Saved in:
Bibliographic Details
Main Authors: Chi-Pang Wen, Xifeng Wu, Qingfeng Hu, Huakang Tu, Shan Pou Tsai, David Ta-Wei Chu
Format: Article
Language:English
Published: BMJ Publishing Group 2024-07-01
Series:BMJ Oncology
Online Access:https://bmjoncology.bmj.com/content/3/1/e000087.full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832581907792003072
author Chi-Pang Wen
Xifeng Wu
Qingfeng Hu
Huakang Tu
Shan Pou Tsai
David Ta-Wei Chu
author_facet Chi-Pang Wen
Xifeng Wu
Qingfeng Hu
Huakang Tu
Shan Pou Tsai
David Ta-Wei Chu
author_sort Chi-Pang Wen
collection DOAJ
description Objective To develop and validate machine-learning models that predict the risk of pan-cancer incidence using demographic, questionnaire and routine health check-up data in a large Asian population.Methods and analysis This study is a prospective cohort study including 433 549 participants from the prospective MJ cohort including a male cohort (n=208 599) and a female cohort (n=224 950).Results During an 8-year median follow-up, 5143 cancers occurred in males and 4764 in females. Compared with Lasso-Cox and Random Survival Forests, XGBoost showed superior performance for both cohorts. The XGBoost model with all 155 features in males and 160 features in females achieved an area under the curve (AUC) of 0.877 and 0.750, respectively. Light models with 31 variables for males and 11 variables for females showed comparable performance: an AUC of 0.876 (95% CI 0.858 to 0.894) in the overall population and 0.818 (95% CI 0.795 to 0.841) in those aged ≥40 years in the male cohort and an AUC of 0.746 (95% CI 0.721 to 0.771) in the overall population and 0.641 (95% CI 0.605 to 0.677) in those aged ≥40 years in the female cohort. High-risk individuals have at least ninefold higher risk of pan-cancer incidence compared with low-risk groups.Conclusion We developed and internally validated the first machine-learning models based on routine health check-up data to predict pan-cancer risk in the general population and achieved generally good discriminatory ability with a small set of predictors. External validation is warranted before the implementation of our risk model in clinical practice.
format Article
id doaj-art-8c4a2105e3c34b53be975b89bafdb473
institution Kabale University
issn 2752-7948
language English
publishDate 2024-07-01
publisher BMJ Publishing Group
record_format Article
series BMJ Oncology
spelling doaj-art-8c4a2105e3c34b53be975b89bafdb4732025-01-30T09:35:14ZengBMJ Publishing GroupBMJ Oncology2752-79482024-07-013110.1136/bmjonc-2023-000087Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohortChi-Pang Wen0Xifeng Wu1Qingfeng Hu2Huakang Tu3Shan Pou Tsai4David Ta-Wei Chu5National Institute for Data Science in Health and Medicine, Zhejiang University, Hangzhou, Zhejiang, ChinaSchool of Public Health, Zhejiang Medical University, Hangzhou, ChinaDepartment of Big Data in Health Science School of Public Health, and Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, ChinaDepartment of Big Data in Health Science School of Public Health, and Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, ChinaMJ Health Research Foundation, Taipei, TaiwanMJ Health Management Center, Taipei, TaiwanObjective To develop and validate machine-learning models that predict the risk of pan-cancer incidence using demographic, questionnaire and routine health check-up data in a large Asian population.Methods and analysis This study is a prospective cohort study including 433 549 participants from the prospective MJ cohort including a male cohort (n=208 599) and a female cohort (n=224 950).Results During an 8-year median follow-up, 5143 cancers occurred in males and 4764 in females. Compared with Lasso-Cox and Random Survival Forests, XGBoost showed superior performance for both cohorts. The XGBoost model with all 155 features in males and 160 features in females achieved an area under the curve (AUC) of 0.877 and 0.750, respectively. Light models with 31 variables for males and 11 variables for females showed comparable performance: an AUC of 0.876 (95% CI 0.858 to 0.894) in the overall population and 0.818 (95% CI 0.795 to 0.841) in those aged ≥40 years in the male cohort and an AUC of 0.746 (95% CI 0.721 to 0.771) in the overall population and 0.641 (95% CI 0.605 to 0.677) in those aged ≥40 years in the female cohort. High-risk individuals have at least ninefold higher risk of pan-cancer incidence compared with low-risk groups.Conclusion We developed and internally validated the first machine-learning models based on routine health check-up data to predict pan-cancer risk in the general population and achieved generally good discriminatory ability with a small set of predictors. External validation is warranted before the implementation of our risk model in clinical practice.https://bmjoncology.bmj.com/content/3/1/e000087.full
spellingShingle Chi-Pang Wen
Xifeng Wu
Qingfeng Hu
Huakang Tu
Shan Pou Tsai
David Ta-Wei Chu
Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort
BMJ Oncology
title Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort
title_full Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort
title_fullStr Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort
title_full_unstemmed Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort
title_short Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort
title_sort novel machine learning algorithm in risk prediction model for pan cancer risk application in a large prospective cohort
url https://bmjoncology.bmj.com/content/3/1/e000087.full
work_keys_str_mv AT chipangwen novelmachinelearningalgorithminriskpredictionmodelforpancancerriskapplicationinalargeprospectivecohort
AT xifengwu novelmachinelearningalgorithminriskpredictionmodelforpancancerriskapplicationinalargeprospectivecohort
AT qingfenghu novelmachinelearningalgorithminriskpredictionmodelforpancancerriskapplicationinalargeprospectivecohort
AT huakangtu novelmachinelearningalgorithminriskpredictionmodelforpancancerriskapplicationinalargeprospectivecohort
AT shanpoutsai novelmachinelearningalgorithminriskpredictionmodelforpancancerriskapplicationinalargeprospectivecohort
AT davidtaweichu novelmachinelearningalgorithminriskpredictionmodelforpancancerriskapplicationinalargeprospectivecohort