Pan-cancer predictive survival model development and evaluation using electronic health record and genetic data across 10 cancer types

Abstract The growing burden of cancer and recent surge in healthcare data availability call for new ways of analysing this multifactorial disease and improving patient outcomes. The aim of this study is to develop and evaluate prognostic cancer survival models across ten common cancer types based on...

Full description

Saved in:
Bibliographic Details
Main Authors: Jurgita Gammall, Alvina G. Lai
Format: Article
Language:English
Published: Springer 2025-05-01
Series:Discover Oncology
Subjects:
Online Access:https://doi.org/10.1007/s12672-025-02523-1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849725184076939264
author Jurgita Gammall
Alvina G. Lai
author_facet Jurgita Gammall
Alvina G. Lai
author_sort Jurgita Gammall
collection DOAJ
description Abstract The growing burden of cancer and recent surge in healthcare data availability call for new ways of analysing this multifactorial disease and improving patient outcomes. The aim of this study is to develop and evaluate prognostic cancer survival models across ten common cancer types based on a large patient sample. We compare the performance of different machine learning algorithms and assess the added value of genetic information in cancer prognosis. We also provide ways to improve model explainabilty which is critical for model adoption in clinical practice. This study included data from 9977 patients with bladder, breast, colorectal, endometrial, glioma, leukaemia, lung, ovarian, prostate, and renal cancers. Genetic data collected through the 100,000 Genomes Project was linked with clinical and demographic data provided by the National Cancer Registration and Analysis Service, Hospital Episode Statistics and Office for National Statistics. More than 500 prognostic features were assessed and four machine learning algorithms including Elastic Net Cox proportional hazards regression, random survival forest, gradient boosting survival and DeepSurv neural network were developed in this study. Most models achieved good performance varying from 60% in bladder cancer to 80% in glioma with the average C-index of 72% across all cancer types. Different machine learning methods achieved similar performance with DeepSurv model slightly underperforming compared to other methods. Addition of genetic data improved performance in endometrial, glioma, ovarian and prostate cancers, showing its potential importance for cancer prognosis. Patient’s age, stage, grade, referral route, waiting times, pre-existing conditions, previous hospital utilisation, tumour mutational burden and mutations in gene TP53 were among the most important features in cancer survival modelling. By offering a comprehensive set of predictive models for cancer survival, this study fills a critical gap in our understanding of cancer prognosis and provides new tools for informing cancer treatment and consequently improving patient outcomes.
format Article
id doaj-art-25efcdd28fdd44448a256b79d693b710
institution DOAJ
issn 2730-6011
language English
publishDate 2025-05-01
publisher Springer
record_format Article
series Discover Oncology
spelling doaj-art-25efcdd28fdd44448a256b79d693b7102025-08-20T03:10:32ZengSpringerDiscover Oncology2730-60112025-05-0116112010.1007/s12672-025-02523-1Pan-cancer predictive survival model development and evaluation using electronic health record and genetic data across 10 cancer typesJurgita Gammall0Alvina G. Lai1Institute of Health Informatics, University College LondonInstitute of Health Informatics, University College LondonAbstract The growing burden of cancer and recent surge in healthcare data availability call for new ways of analysing this multifactorial disease and improving patient outcomes. The aim of this study is to develop and evaluate prognostic cancer survival models across ten common cancer types based on a large patient sample. We compare the performance of different machine learning algorithms and assess the added value of genetic information in cancer prognosis. We also provide ways to improve model explainabilty which is critical for model adoption in clinical practice. This study included data from 9977 patients with bladder, breast, colorectal, endometrial, glioma, leukaemia, lung, ovarian, prostate, and renal cancers. Genetic data collected through the 100,000 Genomes Project was linked with clinical and demographic data provided by the National Cancer Registration and Analysis Service, Hospital Episode Statistics and Office for National Statistics. More than 500 prognostic features were assessed and four machine learning algorithms including Elastic Net Cox proportional hazards regression, random survival forest, gradient boosting survival and DeepSurv neural network were developed in this study. Most models achieved good performance varying from 60% in bladder cancer to 80% in glioma with the average C-index of 72% across all cancer types. Different machine learning methods achieved similar performance with DeepSurv model slightly underperforming compared to other methods. Addition of genetic data improved performance in endometrial, glioma, ovarian and prostate cancers, showing its potential importance for cancer prognosis. Patient’s age, stage, grade, referral route, waiting times, pre-existing conditions, previous hospital utilisation, tumour mutational burden and mutations in gene TP53 were among the most important features in cancer survival modelling. By offering a comprehensive set of predictive models for cancer survival, this study fills a critical gap in our understanding of cancer prognosis and provides new tools for informing cancer treatment and consequently improving patient outcomes.https://doi.org/10.1007/s12672-025-02523-1CancerPrognosisSurvivalPredictive modelMachine learningGenetics
spellingShingle Jurgita Gammall
Alvina G. Lai
Pan-cancer predictive survival model development and evaluation using electronic health record and genetic data across 10 cancer types
Discover Oncology
Cancer
Prognosis
Survival
Predictive model
Machine learning
Genetics
title Pan-cancer predictive survival model development and evaluation using electronic health record and genetic data across 10 cancer types
title_full Pan-cancer predictive survival model development and evaluation using electronic health record and genetic data across 10 cancer types
title_fullStr Pan-cancer predictive survival model development and evaluation using electronic health record and genetic data across 10 cancer types
title_full_unstemmed Pan-cancer predictive survival model development and evaluation using electronic health record and genetic data across 10 cancer types
title_short Pan-cancer predictive survival model development and evaluation using electronic health record and genetic data across 10 cancer types
title_sort pan cancer predictive survival model development and evaluation using electronic health record and genetic data across 10 cancer types
topic Cancer
Prognosis
Survival
Predictive model
Machine learning
Genetics
url https://doi.org/10.1007/s12672-025-02523-1
work_keys_str_mv AT jurgitagammall pancancerpredictivesurvivalmodeldevelopmentandevaluationusingelectronichealthrecordandgeneticdataacross10cancertypes
AT alvinaglai pancancerpredictivesurvivalmodeldevelopmentandevaluationusingelectronichealthrecordandgeneticdataacross10cancertypes