A natural language processing approach to support biomedical data harmonization: Leveraging large language models.

<h4>Background</h4>Biomedical research requires large, diverse samples to produce unbiased results. Retrospective data harmonization is often used to integrate existing datasets to create these samples, but the process is labor-intensive. Automated methods for matching variables across d...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zexu Li, Suraj P Prabhu, Zachary T Popp, Shubhi S Jain, Vijetha Balakundi, Ting Fang Alvin Ang, Rhoda Au, Jinying Chen
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2025-01-01
Series:	PLoS ONE
Online Access:	https://doi.org/10.1371/journal.pone.0328262
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850065024329973760
author	Zexu Li Suraj P Prabhu Zachary T Popp Shubhi S Jain Vijetha Balakundi Ting Fang Alvin Ang Rhoda Au Jinying Chen
author_facet	Zexu Li Suraj P Prabhu Zachary T Popp Shubhi S Jain Vijetha Balakundi Ting Fang Alvin Ang Rhoda Au Jinying Chen
author_sort	Zexu Li
collection	DOAJ
description	<h4>Background</h4>Biomedical research requires large, diverse samples to produce unbiased results. Retrospective data harmonization is often used to integrate existing datasets to create these samples, but the process is labor-intensive. Automated methods for matching variables across datasets can accelerate this process, particularly when harmonizing datasets with numerous variables and varied naming conventions. Research in this area has been limited, primarily focusing on lexical matching and ontology-based semantic matching. We aimed to develop new methods, leveraging large language models (LLMs) and ensemble learning, to automate variable matching.<h4>Methods</h4>This study utilized data from two GERAS cohort studies (European [EU] and Japan [JP]) obtained through the Alzheimer's Disease (AD) Data Initiative's AD workbench. We first manually created a dataset by matching 347 EU variables with 1322 candidate JP variables and treated matched variable pairs as positive instances and unmatched pairs as negative instances. We then developed four natural language processing (NLP) methods using state-of-the-art LLMs (E5, MPNet, MiniLM, and BioLORD-2023) to estimate variable similarity based on variable labels and derivation rules. A lexical matching method using fuzzy matching was included as a baseline model. In addition, we developed an ensemble-learning method, using the Random Forest (RF) model, to integrate individual NLP methods. RF was trained and evaluated on 50 trials. Each trial had a random split (4:1) of training and test sets, with the model's hyperparameters optimized through cross-validation on the training set. For each EU variable, 1322 candidate JP variables were ranked based on NLP-derived similarity scores or RF's probability scores, denoting their likelihood to match the EU variable. Ranking performance was measured by top-n hit ratio (HR-n) and mean reciprocal rank (MRR).<h4>Results</h4>E5 performed best among individual methods, achieving 0.898 HR-30 and 0.700 MRR. RF performed better than E5 on all metrics over 50 trials (P < 0.001) and achieved an average HR-30 of 0.986 and MRR of 0.744. LLM-derived features contributed most to RF's performance. One major cause of errors in automatic variable matching was ambiguous variable definitions.<h4>Conclusion</h4>NLP techniques (especially LLMs), combined with ensemble learning, hold great potential in automating variable matching and accelerating biomedical data harmonization.
format	Article
id	doaj-art-17bc622a2f714ac2a1a555e4f10e4e3b
institution	DOAJ
issn	1932-6203
language	English
publishDate	2025-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj-art-17bc622a2f714ac2a1a555e4f10e4e3b2025-08-20T02:49:06ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01207e032826210.1371/journal.pone.0328262A natural language processing approach to support biomedical data harmonization: Leveraging large language models.Zexu LiSuraj P PrabhuZachary T PoppShubhi S JainVijetha BalakundiTing Fang Alvin AngRhoda AuJinying Chen<h4>Background</h4>Biomedical research requires large, diverse samples to produce unbiased results. Retrospective data harmonization is often used to integrate existing datasets to create these samples, but the process is labor-intensive. Automated methods for matching variables across datasets can accelerate this process, particularly when harmonizing datasets with numerous variables and varied naming conventions. Research in this area has been limited, primarily focusing on lexical matching and ontology-based semantic matching. We aimed to develop new methods, leveraging large language models (LLMs) and ensemble learning, to automate variable matching.<h4>Methods</h4>This study utilized data from two GERAS cohort studies (European [EU] and Japan [JP]) obtained through the Alzheimer's Disease (AD) Data Initiative's AD workbench. We first manually created a dataset by matching 347 EU variables with 1322 candidate JP variables and treated matched variable pairs as positive instances and unmatched pairs as negative instances. We then developed four natural language processing (NLP) methods using state-of-the-art LLMs (E5, MPNet, MiniLM, and BioLORD-2023) to estimate variable similarity based on variable labels and derivation rules. A lexical matching method using fuzzy matching was included as a baseline model. In addition, we developed an ensemble-learning method, using the Random Forest (RF) model, to integrate individual NLP methods. RF was trained and evaluated on 50 trials. Each trial had a random split (4:1) of training and test sets, with the model's hyperparameters optimized through cross-validation on the training set. For each EU variable, 1322 candidate JP variables were ranked based on NLP-derived similarity scores or RF's probability scores, denoting their likelihood to match the EU variable. Ranking performance was measured by top-n hit ratio (HR-n) and mean reciprocal rank (MRR).<h4>Results</h4>E5 performed best among individual methods, achieving 0.898 HR-30 and 0.700 MRR. RF performed better than E5 on all metrics over 50 trials (P < 0.001) and achieved an average HR-30 of 0.986 and MRR of 0.744. LLM-derived features contributed most to RF's performance. One major cause of errors in automatic variable matching was ambiguous variable definitions.<h4>Conclusion</h4>NLP techniques (especially LLMs), combined with ensemble learning, hold great potential in automating variable matching and accelerating biomedical data harmonization.https://doi.org/10.1371/journal.pone.0328262
spellingShingle	Zexu Li Suraj P Prabhu Zachary T Popp Shubhi S Jain Vijetha Balakundi Ting Fang Alvin Ang Rhoda Au Jinying Chen A natural language processing approach to support biomedical data harmonization: Leveraging large language models. PLoS ONE
title	A natural language processing approach to support biomedical data harmonization: Leveraging large language models.
title_full	A natural language processing approach to support biomedical data harmonization: Leveraging large language models.
title_fullStr	A natural language processing approach to support biomedical data harmonization: Leveraging large language models.
title_full_unstemmed	A natural language processing approach to support biomedical data harmonization: Leveraging large language models.
title_short	A natural language processing approach to support biomedical data harmonization: Leveraging large language models.
title_sort	natural language processing approach to support biomedical data harmonization leveraging large language models
url	https://doi.org/10.1371/journal.pone.0328262
work_keys_str_mv	AT zexuli anaturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT surajpprabhu anaturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT zacharytpopp anaturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT shubhisjain anaturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT vijethabalakundi anaturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT tingfangalvinang anaturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT rhodaau anaturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT jinyingchen anaturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT zexuli naturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT surajpprabhu naturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT zacharytpopp naturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT shubhisjain naturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT vijethabalakundi naturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT tingfangalvinang naturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT rhodaau naturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels AT jinyingchen naturallanguageprocessingapproachtosupportbiomedicaldataharmonizationleveraginglargelanguagemodels

A natural language processing approach to support biomedical data harmonization: Leveraging large language models.

Similar Items