Can synthetic data be a proxy for real clinical trial data? A validation study

Objectives There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data.Sett...

Full description

Saved in:
Bibliographic Details
Main Authors: Monica Parry, Karolina Kublickiene, Valeria Raparelli, Peter Klimek, Alexandra Kautzky-Willer, Louise Pilote, Michal Abrahamowicz, Colleen M. Norris, Maria Trinidad Herrero, Khaled El Emam, Ruth Sapir-Pichhadze, Simon Bacon, Jennifer Fishman, Zahra Azizi, Chaoyi Zheng, Lucy Mosquera, Karin Humphries
Format: Article
Language:English
Published: BMJ Publishing Group 2021-04-01
Series:BMJ Open
Online Access:https://bmjopen.bmj.com/content/11/4/e043497.full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846160789480669184
author Monica Parry
Karolina Kublickiene
Valeria Raparelli
Peter Klimek
Alexandra Kautzky-Willer
Louise Pilote
Michal Abrahamowicz
Colleen M. Norris
Maria Trinidad Herrero
Khaled El Emam
Ruth Sapir-Pichhadze
Simon Bacon
Jennifer Fishman
Zahra Azizi
Chaoyi Zheng
Lucy Mosquera
Karin Humphries
Khaled El Emam
author_facet Monica Parry
Karolina Kublickiene
Valeria Raparelli
Peter Klimek
Alexandra Kautzky-Willer
Louise Pilote
Michal Abrahamowicz
Colleen M. Norris
Maria Trinidad Herrero
Khaled El Emam
Ruth Sapir-Pichhadze
Simon Bacon
Jennifer Fishman
Zahra Azizi
Chaoyi Zheng
Lucy Mosquera
Karin Humphries
Khaled El Emam
author_sort Monica Parry
collection DOAJ
description Objectives There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data.Setting Replication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method.Participants There were 1543 patients in the control arm that were included in our analysis.Primary and secondary outcome measures Analyses from a study published on the real dataset were replicated on synthetic data to investigate the relationship between bowel obstruction and event-free survival. Information theoretic metrics were used to compare the univariate distributions between real and synthetic data. Percentage CI overlap was used to assess the similarity in the size of the bivariate relationships, and similarly for the multivariate Cox models derived from the two datasets.Results Analysis results were similar between the real and synthetic datasets. The univariate distributions were within 1% of difference on an information theoretic metric. All of the bivariate relationships had CI overlap on the tau statistic above 50%. The main conclusion from the published study, that lack of bowel obstruction has a strong impact on survival, was replicated directionally and the HR CI overlap between the real and synthetic data was 61% for overall survival (real data: HR 1.56, 95% CI 1.11 to 2.2; synthetic data: HR 2.03, 95% CI 1.44 to 2.87) and 86% for disease-free survival (real data: HR 1.51, 95% CI 1.18 to 1.95; synthetic data: HR 1.63, 95% CI 1.26 to 2.1).Conclusions The high concordance between the analytical results and conclusions from synthetic and real data suggests that synthetic data can be used as a reasonable proxy for real clinical trial datasets.Trial registration number NCT00079274.
format Article
id doaj-art-a2bba3b623124aa6a9ecebc2b70b28c4
institution Kabale University
issn 2044-6055
language English
publishDate 2021-04-01
publisher BMJ Publishing Group
record_format Article
series BMJ Open
spelling doaj-art-a2bba3b623124aa6a9ecebc2b70b28c42024-11-21T19:35:09ZengBMJ Publishing GroupBMJ Open2044-60552021-04-0111410.1136/bmjopen-2020-043497Can synthetic data be a proxy for real clinical trial data? A validation studyMonica Parry0Karolina Kublickiene1Valeria Raparelli2Peter KlimekAlexandra Kautzky-Willer3Louise Pilote4Michal Abrahamowicz5Colleen M. NorrisMaria Trinidad Herrero6Khaled El Emam7Ruth Sapir-PichhadzeSimon Bacon8Jennifer FishmanZahra Azizi9Chaoyi Zheng10Lucy Mosquera11Karin HumphriesKhaled El EmamUniversity of Toronto Lawrence S Bloomberg Faculty of Nursing, Toronto, Ontario, CanadaDepartment of Clinical Science, Intervention and Technology (CLINTEC), Karolinska Institutet, Stockholm, SwedenFaculty of Nursing, University of Alberta, Edmonton, Alberta, Canada1 Internal Medicine III, Division of Endocrinology and Metabolism, Medical University of Vienna, Wien, Austria18 Department of Medicine, McGill University, Montreal, Quebec, CanadaDepartment of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Quebec, CanadaDepartment of Human Anatomy and Psychobiology, Universidad de Murcia, Murcia, Spain9 Electronic Health Information Laboratory, Children’s Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, CanadaDepartment of Health, Kinesiology and Applied Physiology, Concordia University, Montreal, Québec, Canada1 Centre for Outcomes Research and Evaluation, McGill University Health Centre, Montreal, Québec, CanadaData Science, Replica Analytics Ltd, Ottawa, Ontario, CanadaData Science, Replica Analytics Ltd, Ottawa, Ontario, CanadaObjectives There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data.Setting Replication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method.Participants There were 1543 patients in the control arm that were included in our analysis.Primary and secondary outcome measures Analyses from a study published on the real dataset were replicated on synthetic data to investigate the relationship between bowel obstruction and event-free survival. Information theoretic metrics were used to compare the univariate distributions between real and synthetic data. Percentage CI overlap was used to assess the similarity in the size of the bivariate relationships, and similarly for the multivariate Cox models derived from the two datasets.Results Analysis results were similar between the real and synthetic datasets. The univariate distributions were within 1% of difference on an information theoretic metric. All of the bivariate relationships had CI overlap on the tau statistic above 50%. The main conclusion from the published study, that lack of bowel obstruction has a strong impact on survival, was replicated directionally and the HR CI overlap between the real and synthetic data was 61% for overall survival (real data: HR 1.56, 95% CI 1.11 to 2.2; synthetic data: HR 2.03, 95% CI 1.44 to 2.87) and 86% for disease-free survival (real data: HR 1.51, 95% CI 1.18 to 1.95; synthetic data: HR 1.63, 95% CI 1.26 to 2.1).Conclusions The high concordance between the analytical results and conclusions from synthetic and real data suggests that synthetic data can be used as a reasonable proxy for real clinical trial datasets.Trial registration number NCT00079274.https://bmjopen.bmj.com/content/11/4/e043497.full
spellingShingle Monica Parry
Karolina Kublickiene
Valeria Raparelli
Peter Klimek
Alexandra Kautzky-Willer
Louise Pilote
Michal Abrahamowicz
Colleen M. Norris
Maria Trinidad Herrero
Khaled El Emam
Ruth Sapir-Pichhadze
Simon Bacon
Jennifer Fishman
Zahra Azizi
Chaoyi Zheng
Lucy Mosquera
Karin Humphries
Khaled El Emam
Can synthetic data be a proxy for real clinical trial data? A validation study
BMJ Open
title Can synthetic data be a proxy for real clinical trial data? A validation study
title_full Can synthetic data be a proxy for real clinical trial data? A validation study
title_fullStr Can synthetic data be a proxy for real clinical trial data? A validation study
title_full_unstemmed Can synthetic data be a proxy for real clinical trial data? A validation study
title_short Can synthetic data be a proxy for real clinical trial data? A validation study
title_sort can synthetic data be a proxy for real clinical trial data a validation study
url https://bmjopen.bmj.com/content/11/4/e043497.full
work_keys_str_mv AT monicaparry cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT karolinakublickiene cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT valeriaraparelli cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT peterklimek cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT alexandrakautzkywiller cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT louisepilote cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT michalabrahamowicz cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT colleenmnorris cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT mariatrinidadherrero cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT khaledelemam cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT ruthsapirpichhadze cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT simonbacon cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT jenniferfishman cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT zahraazizi cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT chaoyizheng cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT lucymosquera cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT karinhumphries cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy
AT khaledelemam cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy