Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data: Case Control Study

Abstract BackgroundColorectal cancer is now the leading cause of cancer-related deaths among young Americans. Accurate early prediction and a thorough understanding of the risk factors for early-onset colorectal cancer (EOCRC) are vital for effective prevention and treatment,...

Full description

Saved in:
Bibliographic Details
Main Authors: Chengkun Sun, Erin Mobley, Michael Quillen, Max Parker, Meghan Daly, Rui Wang, Isabela Visintin, Ziad Awad, Jennifer Fishe, Alexander Parker, Thomas George, Jiang Bian, Jie Xu
Format: Article
Language:English
Published: JMIR Publications 2025-06-01
Series:JMIR Cancer
Online Access:https://cancer.jmir.org/2025/1/e64506
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849430505468985344
author Chengkun Sun
Erin Mobley
Michael Quillen
Max Parker
Meghan Daly
Rui Wang
Isabela Visintin
Ziad Awad
Jennifer Fishe
Alexander Parker
Thomas George
Jiang Bian
Jie Xu
author_facet Chengkun Sun
Erin Mobley
Michael Quillen
Max Parker
Meghan Daly
Rui Wang
Isabela Visintin
Ziad Awad
Jennifer Fishe
Alexander Parker
Thomas George
Jiang Bian
Jie Xu
author_sort Chengkun Sun
collection DOAJ
description Abstract BackgroundColorectal cancer is now the leading cause of cancer-related deaths among young Americans. Accurate early prediction and a thorough understanding of the risk factors for early-onset colorectal cancer (EOCRC) are vital for effective prevention and treatment, particularly for patients below the recommended screening age. ObjectiveOur study aims to predict EOCRC using machine learning (ML) and structured electronic health record data for individuals under the screening age of 45 years, with the aim of exploring potential risk and protective factors that could support early diagnosis. MethodsWe identified a cohort of patients under the age of 45 years from the OneFlorida+ Clinical Research Consortium. Given the distinct pathology of colon cancer (CC) and rectal cancer (RC), we created separate prediction models for each cancer type with various ML algorithms. We assessed multiple prediction time windows (ie, 0, 1, 3, and 5 y) and ensured robustness through propensity score matching to account for confounding variables including sex, race, ethnicity, and birth year. We conducted a comprehensive performance evaluation using metrics including area under the curve (AUC), sensitivity, specificity, positive predictive value, negative predictive value, and F1 ResultsThe final cohort included 1358 CC cases with 6790 matched controls, and 560 RC cases with 2800 matched controls. The RC group had a more balanced sex distribution (2:3 male-to-female) compared to the CC group (2:5 male-to-female), and both groups showed diverse racial and ethnic representation. Our predictive models demonstrated reasonable results, with AUC scores for CC prediction of 0.811, 0.748, 0.689, and 0.686 at 0, 1, 3, and 5 years before diagnosis, respectively. For RC prediction, AUC scores were 0.829, 0.771, 0.727, and 0.721 across the same time windows. Key predictive features across both cancer types included immune and digestive system disorders, secondary malignancies, and underweight status. In addition, blood diseases emerged as prominent indicators specifically for CC. ConclusionsOur findings demonstrate the potential of ML models leveraging electronic health record data to facilitate the early prediction of EOCRC in individuals under 45 years. By uncovering important risk factors and achieving promising predictive performance, this study provides preliminary insights that could inform future efforts toward earlier detection and prevention in younger populations.
format Article
id doaj-art-2b48754b76e540f1bf7b855604cefc33
institution Kabale University
issn 2369-1999
language English
publishDate 2025-06-01
publisher JMIR Publications
record_format Article
series JMIR Cancer
spelling doaj-art-2b48754b76e540f1bf7b855604cefc332025-08-20T03:27:58ZengJMIR PublicationsJMIR Cancer2369-19992025-06-0111e64506e6450610.2196/64506Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data: Case Control StudyChengkun Sunhttp://orcid.org/0000-0003-2095-9369Erin Mobleyhttp://orcid.org/0000-0002-6278-6593Michael Quillenhttp://orcid.org/0009-0006-6657-1748Max Parkerhttp://orcid.org/0009-0006-7873-9480Meghan Dalyhttp://orcid.org/0009-0002-3445-250XRui Wanghttp://orcid.org/0000-0002-8320-6500Isabela Visintinhttp://orcid.org/0009-0004-7991-2391Ziad Awadhttp://orcid.org/0000-0002-6555-6240Jennifer Fishehttp://orcid.org/0000-0001-9037-8143Alexander Parkerhttp://orcid.org/0000-0001-6820-274XThomas Georgehttp://orcid.org/0000-0002-6249-9180Jiang Bianhttp://orcid.org/0000-0002-2238-5429Jie Xuhttp://orcid.org/0000-0001-5291-5198 Abstract BackgroundColorectal cancer is now the leading cause of cancer-related deaths among young Americans. Accurate early prediction and a thorough understanding of the risk factors for early-onset colorectal cancer (EOCRC) are vital for effective prevention and treatment, particularly for patients below the recommended screening age. ObjectiveOur study aims to predict EOCRC using machine learning (ML) and structured electronic health record data for individuals under the screening age of 45 years, with the aim of exploring potential risk and protective factors that could support early diagnosis. MethodsWe identified a cohort of patients under the age of 45 years from the OneFlorida+ Clinical Research Consortium. Given the distinct pathology of colon cancer (CC) and rectal cancer (RC), we created separate prediction models for each cancer type with various ML algorithms. We assessed multiple prediction time windows (ie, 0, 1, 3, and 5 y) and ensured robustness through propensity score matching to account for confounding variables including sex, race, ethnicity, and birth year. We conducted a comprehensive performance evaluation using metrics including area under the curve (AUC), sensitivity, specificity, positive predictive value, negative predictive value, and F1 ResultsThe final cohort included 1358 CC cases with 6790 matched controls, and 560 RC cases with 2800 matched controls. The RC group had a more balanced sex distribution (2:3 male-to-female) compared to the CC group (2:5 male-to-female), and both groups showed diverse racial and ethnic representation. Our predictive models demonstrated reasonable results, with AUC scores for CC prediction of 0.811, 0.748, 0.689, and 0.686 at 0, 1, 3, and 5 years before diagnosis, respectively. For RC prediction, AUC scores were 0.829, 0.771, 0.727, and 0.721 across the same time windows. Key predictive features across both cancer types included immune and digestive system disorders, secondary malignancies, and underweight status. In addition, blood diseases emerged as prominent indicators specifically for CC. ConclusionsOur findings demonstrate the potential of ML models leveraging electronic health record data to facilitate the early prediction of EOCRC in individuals under 45 years. By uncovering important risk factors and achieving promising predictive performance, this study provides preliminary insights that could inform future efforts toward earlier detection and prevention in younger populations.https://cancer.jmir.org/2025/1/e64506
spellingShingle Chengkun Sun
Erin Mobley
Michael Quillen
Max Parker
Meghan Daly
Rui Wang
Isabela Visintin
Ziad Awad
Jennifer Fishe
Alexander Parker
Thomas George
Jiang Bian
Jie Xu
Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data: Case Control Study
JMIR Cancer
title Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data: Case Control Study
title_full Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data: Case Control Study
title_fullStr Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data: Case Control Study
title_full_unstemmed Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data: Case Control Study
title_short Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data: Case Control Study
title_sort predicting early onset colorectal cancer in individuals below screening age using machine learning and real world data case control study
url https://cancer.jmir.org/2025/1/e64506
work_keys_str_mv AT chengkunsun predictingearlyonsetcolorectalcancerinindividualsbelowscreeningageusingmachinelearningandrealworlddatacasecontrolstudy
AT erinmobley predictingearlyonsetcolorectalcancerinindividualsbelowscreeningageusingmachinelearningandrealworlddatacasecontrolstudy
AT michaelquillen predictingearlyonsetcolorectalcancerinindividualsbelowscreeningageusingmachinelearningandrealworlddatacasecontrolstudy
AT maxparker predictingearlyonsetcolorectalcancerinindividualsbelowscreeningageusingmachinelearningandrealworlddatacasecontrolstudy
AT meghandaly predictingearlyonsetcolorectalcancerinindividualsbelowscreeningageusingmachinelearningandrealworlddatacasecontrolstudy
AT ruiwang predictingearlyonsetcolorectalcancerinindividualsbelowscreeningageusingmachinelearningandrealworlddatacasecontrolstudy
AT isabelavisintin predictingearlyonsetcolorectalcancerinindividualsbelowscreeningageusingmachinelearningandrealworlddatacasecontrolstudy
AT ziadawad predictingearlyonsetcolorectalcancerinindividualsbelowscreeningageusingmachinelearningandrealworlddatacasecontrolstudy
AT jenniferfishe predictingearlyonsetcolorectalcancerinindividualsbelowscreeningageusingmachinelearningandrealworlddatacasecontrolstudy
AT alexanderparker predictingearlyonsetcolorectalcancerinindividualsbelowscreeningageusingmachinelearningandrealworlddatacasecontrolstudy
AT thomasgeorge predictingearlyonsetcolorectalcancerinindividualsbelowscreeningageusingmachinelearningandrealworlddatacasecontrolstudy
AT jiangbian predictingearlyonsetcolorectalcancerinindividualsbelowscreeningageusingmachinelearningandrealworlddatacasecontrolstudy
AT jiexu predictingearlyonsetcolorectalcancerinindividualsbelowscreeningageusingmachinelearningandrealworlddatacasecontrolstudy