Development of time to event prediction models using federated learning

Abstract Background In a wide range of diseases, it is necessary to utilize multiple data sources to obtain enough data for model training. However, performing centralized pooling of multiple data sources, while protecting each patients’ sensitive data, can require a cumbersome process involving man...

Full description

Saved in:
Bibliographic Details
Main Authors: Rasmus Rask Kragh Jørgensen, Jonas Faartoft Jensen, Tarec El-Galaly, Martin Bøgsted, Rasmus Froberg Brøndum, Mikkel Runason Simonsen, Lasse Hjort Jakobsen
Format: Article
Language:English
Published: BMC 2025-05-01
Series:BMC Medical Research Methodology
Online Access:https://doi.org/10.1186/s12874-025-02598-y
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Background In a wide range of diseases, it is necessary to utilize multiple data sources to obtain enough data for model training. However, performing centralized pooling of multiple data sources, while protecting each patients’ sensitive data, can require a cumbersome process involving many institutional bodies. Alternatively, federated learning (FL) can be utilized to train models based on data located at multiple sites. Method We propose two methods for training time-to-event prediction models based on distributed data, relying on FL algorithms, for time-to-event prediction models. Both approach incorporates steps to allow prediction of individual-level survival curves, without exposing individual-level event times. For Cox proportional hazards models, the latter is accomplished by using a kernel smoother for the baseline hazard function. The other proposed methodology is based on general parametric likelihood theory for right-censored data. We compared these two methods in four simulation and with one real-world dataset predicting the survival probability in patients with Hodgkin lymphoma (HL). Results The simulations demonstrated that the FL models performed similarly to the non-distributed case in all four experiments, with only slight deviations in predicted survival probabilities compared to the true model. Our findings were similar in the real-world advanced-stage HL example where the FL models were compared to their non-distributed versions, revealing only small deviations in performance. Conclusion The proposed procedures enable training of time-to-event models using data distributed across sites, without direct sharing of individual-level data and event times, while retaining a predictive performance on par with undistributed approaches.
ISSN:1471-2288