Can synthetic data be a proxy for real clinical trial data? A validation study
Objectives There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data.Sett...
Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMJ Publishing Group
2021-04-01
|
| Series: | BMJ Open |
| Online Access: | https://bmjopen.bmj.com/content/11/4/e043497.full |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1846160789480669184 |
|---|---|
| author | Monica Parry Karolina Kublickiene Valeria Raparelli Peter Klimek Alexandra Kautzky-Willer Louise Pilote Michal Abrahamowicz Colleen M. Norris Maria Trinidad Herrero Khaled El Emam Ruth Sapir-Pichhadze Simon Bacon Jennifer Fishman Zahra Azizi Chaoyi Zheng Lucy Mosquera Karin Humphries Khaled El Emam |
| author_facet | Monica Parry Karolina Kublickiene Valeria Raparelli Peter Klimek Alexandra Kautzky-Willer Louise Pilote Michal Abrahamowicz Colleen M. Norris Maria Trinidad Herrero Khaled El Emam Ruth Sapir-Pichhadze Simon Bacon Jennifer Fishman Zahra Azizi Chaoyi Zheng Lucy Mosquera Karin Humphries Khaled El Emam |
| author_sort | Monica Parry |
| collection | DOAJ |
| description | Objectives There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data.Setting Replication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method.Participants There were 1543 patients in the control arm that were included in our analysis.Primary and secondary outcome measures Analyses from a study published on the real dataset were replicated on synthetic data to investigate the relationship between bowel obstruction and event-free survival. Information theoretic metrics were used to compare the univariate distributions between real and synthetic data. Percentage CI overlap was used to assess the similarity in the size of the bivariate relationships, and similarly for the multivariate Cox models derived from the two datasets.Results Analysis results were similar between the real and synthetic datasets. The univariate distributions were within 1% of difference on an information theoretic metric. All of the bivariate relationships had CI overlap on the tau statistic above 50%. The main conclusion from the published study, that lack of bowel obstruction has a strong impact on survival, was replicated directionally and the HR CI overlap between the real and synthetic data was 61% for overall survival (real data: HR 1.56, 95% CI 1.11 to 2.2; synthetic data: HR 2.03, 95% CI 1.44 to 2.87) and 86% for disease-free survival (real data: HR 1.51, 95% CI 1.18 to 1.95; synthetic data: HR 1.63, 95% CI 1.26 to 2.1).Conclusions The high concordance between the analytical results and conclusions from synthetic and real data suggests that synthetic data can be used as a reasonable proxy for real clinical trial datasets.Trial registration number NCT00079274. |
| format | Article |
| id | doaj-art-a2bba3b623124aa6a9ecebc2b70b28c4 |
| institution | Kabale University |
| issn | 2044-6055 |
| language | English |
| publishDate | 2021-04-01 |
| publisher | BMJ Publishing Group |
| record_format | Article |
| series | BMJ Open |
| spelling | doaj-art-a2bba3b623124aa6a9ecebc2b70b28c42024-11-21T19:35:09ZengBMJ Publishing GroupBMJ Open2044-60552021-04-0111410.1136/bmjopen-2020-043497Can synthetic data be a proxy for real clinical trial data? A validation studyMonica Parry0Karolina Kublickiene1Valeria Raparelli2Peter KlimekAlexandra Kautzky-Willer3Louise Pilote4Michal Abrahamowicz5Colleen M. NorrisMaria Trinidad Herrero6Khaled El Emam7Ruth Sapir-PichhadzeSimon Bacon8Jennifer FishmanZahra Azizi9Chaoyi Zheng10Lucy Mosquera11Karin HumphriesKhaled El EmamUniversity of Toronto Lawrence S Bloomberg Faculty of Nursing, Toronto, Ontario, CanadaDepartment of Clinical Science, Intervention and Technology (CLINTEC), Karolinska Institutet, Stockholm, SwedenFaculty of Nursing, University of Alberta, Edmonton, Alberta, Canada1 Internal Medicine III, Division of Endocrinology and Metabolism, Medical University of Vienna, Wien, Austria18 Department of Medicine, McGill University, Montreal, Quebec, CanadaDepartment of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Quebec, CanadaDepartment of Human Anatomy and Psychobiology, Universidad de Murcia, Murcia, Spain9 Electronic Health Information Laboratory, Children’s Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, CanadaDepartment of Health, Kinesiology and Applied Physiology, Concordia University, Montreal, Québec, Canada1 Centre for Outcomes Research and Evaluation, McGill University Health Centre, Montreal, Québec, CanadaData Science, Replica Analytics Ltd, Ottawa, Ontario, CanadaData Science, Replica Analytics Ltd, Ottawa, Ontario, CanadaObjectives There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data.Setting Replication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method.Participants There were 1543 patients in the control arm that were included in our analysis.Primary and secondary outcome measures Analyses from a study published on the real dataset were replicated on synthetic data to investigate the relationship between bowel obstruction and event-free survival. Information theoretic metrics were used to compare the univariate distributions between real and synthetic data. Percentage CI overlap was used to assess the similarity in the size of the bivariate relationships, and similarly for the multivariate Cox models derived from the two datasets.Results Analysis results were similar between the real and synthetic datasets. The univariate distributions were within 1% of difference on an information theoretic metric. All of the bivariate relationships had CI overlap on the tau statistic above 50%. The main conclusion from the published study, that lack of bowel obstruction has a strong impact on survival, was replicated directionally and the HR CI overlap between the real and synthetic data was 61% for overall survival (real data: HR 1.56, 95% CI 1.11 to 2.2; synthetic data: HR 2.03, 95% CI 1.44 to 2.87) and 86% for disease-free survival (real data: HR 1.51, 95% CI 1.18 to 1.95; synthetic data: HR 1.63, 95% CI 1.26 to 2.1).Conclusions The high concordance between the analytical results and conclusions from synthetic and real data suggests that synthetic data can be used as a reasonable proxy for real clinical trial datasets.Trial registration number NCT00079274.https://bmjopen.bmj.com/content/11/4/e043497.full |
| spellingShingle | Monica Parry Karolina Kublickiene Valeria Raparelli Peter Klimek Alexandra Kautzky-Willer Louise Pilote Michal Abrahamowicz Colleen M. Norris Maria Trinidad Herrero Khaled El Emam Ruth Sapir-Pichhadze Simon Bacon Jennifer Fishman Zahra Azizi Chaoyi Zheng Lucy Mosquera Karin Humphries Khaled El Emam Can synthetic data be a proxy for real clinical trial data? A validation study BMJ Open |
| title | Can synthetic data be a proxy for real clinical trial data? A validation study |
| title_full | Can synthetic data be a proxy for real clinical trial data? A validation study |
| title_fullStr | Can synthetic data be a proxy for real clinical trial data? A validation study |
| title_full_unstemmed | Can synthetic data be a proxy for real clinical trial data? A validation study |
| title_short | Can synthetic data be a proxy for real clinical trial data? A validation study |
| title_sort | can synthetic data be a proxy for real clinical trial data a validation study |
| url | https://bmjopen.bmj.com/content/11/4/e043497.full |
| work_keys_str_mv | AT monicaparry cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT karolinakublickiene cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT valeriaraparelli cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT peterklimek cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT alexandrakautzkywiller cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT louisepilote cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT michalabrahamowicz cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT colleenmnorris cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT mariatrinidadherrero cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT khaledelemam cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT ruthsapirpichhadze cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT simonbacon cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT jenniferfishman cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT zahraazizi cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT chaoyizheng cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT lucymosquera cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT karinhumphries cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy AT khaledelemam cansyntheticdatabeaproxyforrealclinicaltrialdataavalidationstudy |