Social determinants of health extraction from clinical notes across institutions using large language models

Abstract Detailed social determinants of health (SDoH) is often buried within clinical text in EHRs. Most current NLP efforts for SDoH have limitations, investigating limited factors, deriving data from a single institution, using specific patient cohorts/note types, with reduced focus on generaliza...

Full description

Saved in:
Bibliographic Details
Main Authors: Vipina K. Keloth, Salih Selek, Qingyu Chen, Christopher Gilman, Sunyang Fu, Yifang Dang, Xinghan Chen, Xinyue Hu, Yujia Zhou, Huan He, Jungwei W. Fan, Karen Wang, Cynthia Brandt, Cui Tao, Hongfang Liu, Hua Xu
Format: Article
Language:English
Published: Nature Portfolio 2025-05-01
Series:npj Digital Medicine
Online Access:https://doi.org/10.1038/s41746-025-01645-8
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850154790421528576
author Vipina K. Keloth
Salih Selek
Qingyu Chen
Christopher Gilman
Sunyang Fu
Yifang Dang
Xinghan Chen
Xinyue Hu
Yujia Zhou
Huan He
Jungwei W. Fan
Karen Wang
Cynthia Brandt
Cui Tao
Hongfang Liu
Hua Xu
author_facet Vipina K. Keloth
Salih Selek
Qingyu Chen
Christopher Gilman
Sunyang Fu
Yifang Dang
Xinghan Chen
Xinyue Hu
Yujia Zhou
Huan He
Jungwei W. Fan
Karen Wang
Cynthia Brandt
Cui Tao
Hongfang Liu
Hua Xu
author_sort Vipina K. Keloth
collection DOAJ
description Abstract Detailed social determinants of health (SDoH) is often buried within clinical text in EHRs. Most current NLP efforts for SDoH have limitations, investigating limited factors, deriving data from a single institution, using specific patient cohorts/note types, with reduced focus on generalizability. We aim to address these issues by creating cross-institutional corpora and developing and evaluating the generalizability of classification models, including large language models (LLMs), for detecting SDoH factors using data from four institutions. Clinical notes were annotated with 21 SDoH factors at two levels: level 1 (SDoH factors only) and level 2 (SDoH factors and associated values). Compared to other models, instruction tuned LLM achieved top performance with micro-averaged F1 over 0.9 on level 1 corpora and over 0.84 on level 2 corpora. While models performed well when trained and tested on individual datasets, cross-dataset generalization highlighted remaining obstacles. Access to trained models will be made available at https://github.com/BIDS-Xu-Lab/LLMs4SDoH .
format Article
id doaj-art-11b6b84ce18147ce834aa7fe4348a042
institution OA Journals
issn 2398-6352
language English
publishDate 2025-05-01
publisher Nature Portfolio
record_format Article
series npj Digital Medicine
spelling doaj-art-11b6b84ce18147ce834aa7fe4348a0422025-08-20T02:25:12ZengNature Portfolionpj Digital Medicine2398-63522025-05-018111310.1038/s41746-025-01645-8Social determinants of health extraction from clinical notes across institutions using large language modelsVipina K. Keloth0Salih Selek1Qingyu Chen2Christopher Gilman3Sunyang Fu4Yifang Dang5Xinghan Chen6Xinyue Hu7Yujia Zhou8Huan He9Jungwei W. Fan10Karen Wang11Cynthia Brandt12Cui Tao13Hongfang Liu14Hua Xu15Department of Biomedical Informatics and Data Science, Yale School of MedicineDepartment of Psychiatry and Behavioral Sciences, UTHealth McGovern Medical SchoolDepartment of Biomedical Informatics and Data Science, Yale School of MedicineDepartment of Biomedical Informatics and Data Science, Yale School of MedicineMcWilliams School of Biomedical Informatics, University of Texas Health Science Center at HoustonMcWilliams School of Biomedical Informatics, University of Texas Health Science Center at HoustonSchool of Public Health, University of Texas Health Science Center at HoustonDepartment of Artificial Intelligence and Informatics, Mayo ClinicDepartment of Biomedical Informatics and Data Science, Yale School of MedicineDepartment of Biomedical Informatics and Data Science, Yale School of MedicineDepartment of Artificial Intelligence and Informatics, Mayo ClinicDepartment of Biomedical Informatics and Data Science, Yale School of MedicineDepartment of Biomedical Informatics and Data Science, Yale School of MedicineDepartment of Artificial Intelligence and Informatics, Mayo ClinicMcWilliams School of Biomedical Informatics, University of Texas Health Science Center at HoustonDepartment of Biomedical Informatics and Data Science, Yale School of MedicineAbstract Detailed social determinants of health (SDoH) is often buried within clinical text in EHRs. Most current NLP efforts for SDoH have limitations, investigating limited factors, deriving data from a single institution, using specific patient cohorts/note types, with reduced focus on generalizability. We aim to address these issues by creating cross-institutional corpora and developing and evaluating the generalizability of classification models, including large language models (LLMs), for detecting SDoH factors using data from four institutions. Clinical notes were annotated with 21 SDoH factors at two levels: level 1 (SDoH factors only) and level 2 (SDoH factors and associated values). Compared to other models, instruction tuned LLM achieved top performance with micro-averaged F1 over 0.9 on level 1 corpora and over 0.84 on level 2 corpora. While models performed well when trained and tested on individual datasets, cross-dataset generalization highlighted remaining obstacles. Access to trained models will be made available at https://github.com/BIDS-Xu-Lab/LLMs4SDoH .https://doi.org/10.1038/s41746-025-01645-8
spellingShingle Vipina K. Keloth
Salih Selek
Qingyu Chen
Christopher Gilman
Sunyang Fu
Yifang Dang
Xinghan Chen
Xinyue Hu
Yujia Zhou
Huan He
Jungwei W. Fan
Karen Wang
Cynthia Brandt
Cui Tao
Hongfang Liu
Hua Xu
Social determinants of health extraction from clinical notes across institutions using large language models
npj Digital Medicine
title Social determinants of health extraction from clinical notes across institutions using large language models
title_full Social determinants of health extraction from clinical notes across institutions using large language models
title_fullStr Social determinants of health extraction from clinical notes across institutions using large language models
title_full_unstemmed Social determinants of health extraction from clinical notes across institutions using large language models
title_short Social determinants of health extraction from clinical notes across institutions using large language models
title_sort social determinants of health extraction from clinical notes across institutions using large language models
url https://doi.org/10.1038/s41746-025-01645-8
work_keys_str_mv AT vipinakkeloth socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT salihselek socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT qingyuchen socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT christophergilman socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT sunyangfu socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT yifangdang socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT xinghanchen socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT xinyuehu socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT yujiazhou socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT huanhe socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT jungweiwfan socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT karenwang socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT cynthiabrandt socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT cuitao socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT hongfangliu socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels
AT huaxu socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels