Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study

Abstract BackgroundCohort studies contain rich clinical data across large and diverse patient populations and are a common source of observational data for clinical research. Because large scale cohort studies are both time and resource intensive, one alternative is to harmoni...

Full description

Saved in:

Bibliographic Details
Main Authors:	Doris Yang, Doudou Zhou, Steven Cai, Ziming Gan, Michael Pencina, Paul Avillach, Tianxi Cai, Chuan Hong
Format:	Article
Language:	English
Published:	JMIR Publications 2025-01-01
Series:	JMIR Medical Informatics
Online Access:	https://medinform.jmir.org/2025/1/e54133
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850077668593106944
author	Doris Yang Doudou Zhou Steven Cai Ziming Gan Michael Pencina Paul Avillach Tianxi Cai Chuan Hong
author_facet	Doris Yang Doudou Zhou Steven Cai Ziming Gan Michael Pencina Paul Avillach Tianxi Cai Chuan Hong
author_sort	Doris Yang
collection	DOAJ
description	Abstract BackgroundCohort studies contain rich clinical data across large and diverse patient populations and are a common source of observational data for clinical research. Because large scale cohort studies are both time and resource intensive, one alternative is to harmonize data from existing cohorts through multicohort studies. However, given differences in variable encoding, accurate variable harmonization is difficult. ObjectiveWe propose SONAR (Semantic and Distribution-Based Harmonization) as a method for harmonizing variables across cohort studies to facilitate multicohort studies. MethodsSONAR used semantic learning from variable descriptions and distribution learning from study participant data. Our method learned an embedding vector for each variable and used pairwise cosine similarity to score the similarity between variables. This approach was built off 3 National Institutes of Health cohorts, including the Cardiovascular Health Study, the Multi-Ethnic Study of Atherosclerosis, and the Women’s Health Initiative. We also used gold standard labels to further refine the embeddings in a supervised manner. ResultsThe method was evaluated using manually curated gold standard labels from the 3 National Institutes of Health cohorts. We evaluated both the intracohort and intercohort variable harmonization performance. The supervised SONAR method outperformed existing benchmark methods for almost all intracohort and intercohort comparisons using area under the curve and top-k ConclusionsSONAR achieves accurate variable harmonization within and between cohort studies by harnessing the complementary strengths of semantic learning and variable distribution learning.
format	Article
id	doaj-art-1fd7a624d89d49959ea52737ee91d565
institution	DOAJ
issn	2291-9694
language	English
publishDate	2025-01-01
publisher	JMIR Publications
record_format	Article
series	JMIR Medical Informatics
spelling	doaj-art-1fd7a624d89d49959ea52737ee91d5652025-08-20T02:45:45ZengJMIR PublicationsJMIR Medical Informatics2291-96942025-01-0113e54133e5413310.2196/54133Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation StudyDoris Yanghttp://orcid.org/0000-0002-5188-2571Doudou Zhouhttp://orcid.org/0000-0002-0830-2287Steven Caihttp://orcid.org/0009-0008-2753-9176Ziming Ganhttp://orcid.org/0009-0009-1661-3753Michael Pencinahttp://orcid.org/0000-0002-1968-2641Paul Avillachhttp://orcid.org/0000-0002-0235-7543Tianxi Caihttp://orcid.org/0000-0002-5379-2502Chuan Honghttp://orcid.org/0000-0001-7056-9559 Abstract BackgroundCohort studies contain rich clinical data across large and diverse patient populations and are a common source of observational data for clinical research. Because large scale cohort studies are both time and resource intensive, one alternative is to harmonize data from existing cohorts through multicohort studies. However, given differences in variable encoding, accurate variable harmonization is difficult. ObjectiveWe propose SONAR (Semantic and Distribution-Based Harmonization) as a method for harmonizing variables across cohort studies to facilitate multicohort studies. MethodsSONAR used semantic learning from variable descriptions and distribution learning from study participant data. Our method learned an embedding vector for each variable and used pairwise cosine similarity to score the similarity between variables. This approach was built off 3 National Institutes of Health cohorts, including the Cardiovascular Health Study, the Multi-Ethnic Study of Atherosclerosis, and the Women’s Health Initiative. We also used gold standard labels to further refine the embeddings in a supervised manner. ResultsThe method was evaluated using manually curated gold standard labels from the 3 National Institutes of Health cohorts. We evaluated both the intracohort and intercohort variable harmonization performance. The supervised SONAR method outperformed existing benchmark methods for almost all intracohort and intercohort comparisons using area under the curve and top-k ConclusionsSONAR achieves accurate variable harmonization within and between cohort studies by harnessing the complementary strengths of semantic learning and variable distribution learning.https://medinform.jmir.org/2025/1/e54133
spellingShingle	Doris Yang Doudou Zhou Steven Cai Ziming Gan Michael Pencina Paul Avillach Tianxi Cai Chuan Hong Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study JMIR Medical Informatics
title	Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study
title_full	Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study
title_fullStr	Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study
title_full_unstemmed	Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study
title_short	Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study
title_sort	robust automated harmonization of heterogeneous data through ensemble machine learning algorithm development and validation study
url	https://medinform.jmir.org/2025/1/e54133
work_keys_str_mv	AT dorisyang robustautomatedharmonizationofheterogeneousdatathroughensemblemachinelearningalgorithmdevelopmentandvalidationstudy AT doudouzhou robustautomatedharmonizationofheterogeneousdatathroughensemblemachinelearningalgorithmdevelopmentandvalidationstudy AT stevencai robustautomatedharmonizationofheterogeneousdatathroughensemblemachinelearningalgorithmdevelopmentandvalidationstudy AT ziminggan robustautomatedharmonizationofheterogeneousdatathroughensemblemachinelearningalgorithmdevelopmentandvalidationstudy AT michaelpencina robustautomatedharmonizationofheterogeneousdatathroughensemblemachinelearningalgorithmdevelopmentandvalidationstudy AT paulavillach robustautomatedharmonizationofheterogeneousdatathroughensemblemachinelearningalgorithmdevelopmentandvalidationstudy AT tianxicai robustautomatedharmonizationofheterogeneousdatathroughensemblemachinelearningalgorithmdevelopmentandvalidationstudy AT chuanhong robustautomatedharmonizationofheterogeneousdatathroughensemblemachinelearningalgorithmdevelopmentandvalidationstudy

Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study

Similar Items