Synthetic data for privacy-preserving clinical risk prediction

Abstract Synthetic data promise privacy-preserving data sharing for healthcare research and development. Compared with other privacy-enhancing approaches—such as federated learning—analyses performed on synthetic data can be applied downstream without modification, such that synthetic data can act i...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhaozhi Qian, Thomas Callender, Bogdan Cebere, Sam M. Janes, Neal Navani, Mihaela van der Schaar
Format: Article
Language:English
Published: Nature Portfolio 2024-10-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-024-72894-y
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849235242774167552
author Zhaozhi Qian
Thomas Callender
Bogdan Cebere
Sam M. Janes
Neal Navani
Mihaela van der Schaar
author_facet Zhaozhi Qian
Thomas Callender
Bogdan Cebere
Sam M. Janes
Neal Navani
Mihaela van der Schaar
author_sort Zhaozhi Qian
collection DOAJ
description Abstract Synthetic data promise privacy-preserving data sharing for healthcare research and development. Compared with other privacy-enhancing approaches—such as federated learning—analyses performed on synthetic data can be applied downstream without modification, such that synthetic data can act in place of real data for a wide range of use cases. However, the role that synthetic data might play in all aspects of clinical model development remains unknown. In this work, we used state-of-the-art generators explicitly designed for privacy preservation to create a synthetic version of ever-smokers in the UK Biobank before building prognostic models for lung cancer under several data release assumptions. We demonstrate that synthetic data can be effectively used throughout the medical prognostic modeling pipeline even without eventual access to the real data. Furthermore, we show the implications of different data release approaches on how synthetic biobank data could be deployed within the healthcare system.
format Article
id doaj-art-7630251e41d64bfea3d915e53a907c70
institution Kabale University
issn 2045-2322
language English
publishDate 2024-10-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-7630251e41d64bfea3d915e53a907c702025-08-20T04:02:51ZengNature PortfolioScientific Reports2045-23222024-10-0114111410.1038/s41598-024-72894-ySynthetic data for privacy-preserving clinical risk predictionZhaozhi Qian0Thomas Callender1Bogdan Cebere2Sam M. Janes3Neal Navani4Mihaela van der Schaar5University of CambridgeUniversity College LondonUniversity of CambridgeUniversity College LondonUniversity College LondonUniversity of CambridgeAbstract Synthetic data promise privacy-preserving data sharing for healthcare research and development. Compared with other privacy-enhancing approaches—such as federated learning—analyses performed on synthetic data can be applied downstream without modification, such that synthetic data can act in place of real data for a wide range of use cases. However, the role that synthetic data might play in all aspects of clinical model development remains unknown. In this work, we used state-of-the-art generators explicitly designed for privacy preservation to create a synthetic version of ever-smokers in the UK Biobank before building prognostic models for lung cancer under several data release assumptions. We demonstrate that synthetic data can be effectively used throughout the medical prognostic modeling pipeline even without eventual access to the real data. Furthermore, we show the implications of different data release approaches on how synthetic biobank data could be deployed within the healthcare system.https://doi.org/10.1038/s41598-024-72894-ySynthetic dataMachine learningRisk-prediction
spellingShingle Zhaozhi Qian
Thomas Callender
Bogdan Cebere
Sam M. Janes
Neal Navani
Mihaela van der Schaar
Synthetic data for privacy-preserving clinical risk prediction
Scientific Reports
Synthetic data
Machine learning
Risk-prediction
title Synthetic data for privacy-preserving clinical risk prediction
title_full Synthetic data for privacy-preserving clinical risk prediction
title_fullStr Synthetic data for privacy-preserving clinical risk prediction
title_full_unstemmed Synthetic data for privacy-preserving clinical risk prediction
title_short Synthetic data for privacy-preserving clinical risk prediction
title_sort synthetic data for privacy preserving clinical risk prediction
topic Synthetic data
Machine learning
Risk-prediction
url https://doi.org/10.1038/s41598-024-72894-y
work_keys_str_mv AT zhaozhiqian syntheticdataforprivacypreservingclinicalriskprediction
AT thomascallender syntheticdataforprivacypreservingclinicalriskprediction
AT bogdancebere syntheticdataforprivacypreservingclinicalriskprediction
AT sammjanes syntheticdataforprivacypreservingclinicalriskprediction
AT nealnavani syntheticdataforprivacypreservingclinicalriskprediction
AT mihaelavanderschaar syntheticdataforprivacypreservingclinicalriskprediction