Social determinants of health extraction from clinical notes across institutions using large language models
Abstract Detailed social determinants of health (SDoH) is often buried within clinical text in EHRs. Most current NLP efforts for SDoH have limitations, investigating limited factors, deriving data from a single institution, using specific patient cohorts/note types, with reduced focus on generaliza...
Saved in:
| Main Authors: | , , , , , , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-05-01
|
| Series: | npj Digital Medicine |
| Online Access: | https://doi.org/10.1038/s41746-025-01645-8 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850154790421528576 |
|---|---|
| author | Vipina K. Keloth Salih Selek Qingyu Chen Christopher Gilman Sunyang Fu Yifang Dang Xinghan Chen Xinyue Hu Yujia Zhou Huan He Jungwei W. Fan Karen Wang Cynthia Brandt Cui Tao Hongfang Liu Hua Xu |
| author_facet | Vipina K. Keloth Salih Selek Qingyu Chen Christopher Gilman Sunyang Fu Yifang Dang Xinghan Chen Xinyue Hu Yujia Zhou Huan He Jungwei W. Fan Karen Wang Cynthia Brandt Cui Tao Hongfang Liu Hua Xu |
| author_sort | Vipina K. Keloth |
| collection | DOAJ |
| description | Abstract Detailed social determinants of health (SDoH) is often buried within clinical text in EHRs. Most current NLP efforts for SDoH have limitations, investigating limited factors, deriving data from a single institution, using specific patient cohorts/note types, with reduced focus on generalizability. We aim to address these issues by creating cross-institutional corpora and developing and evaluating the generalizability of classification models, including large language models (LLMs), for detecting SDoH factors using data from four institutions. Clinical notes were annotated with 21 SDoH factors at two levels: level 1 (SDoH factors only) and level 2 (SDoH factors and associated values). Compared to other models, instruction tuned LLM achieved top performance with micro-averaged F1 over 0.9 on level 1 corpora and over 0.84 on level 2 corpora. While models performed well when trained and tested on individual datasets, cross-dataset generalization highlighted remaining obstacles. Access to trained models will be made available at https://github.com/BIDS-Xu-Lab/LLMs4SDoH . |
| format | Article |
| id | doaj-art-11b6b84ce18147ce834aa7fe4348a042 |
| institution | OA Journals |
| issn | 2398-6352 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | npj Digital Medicine |
| spelling | doaj-art-11b6b84ce18147ce834aa7fe4348a0422025-08-20T02:25:12ZengNature Portfolionpj Digital Medicine2398-63522025-05-018111310.1038/s41746-025-01645-8Social determinants of health extraction from clinical notes across institutions using large language modelsVipina K. Keloth0Salih Selek1Qingyu Chen2Christopher Gilman3Sunyang Fu4Yifang Dang5Xinghan Chen6Xinyue Hu7Yujia Zhou8Huan He9Jungwei W. Fan10Karen Wang11Cynthia Brandt12Cui Tao13Hongfang Liu14Hua Xu15Department of Biomedical Informatics and Data Science, Yale School of MedicineDepartment of Psychiatry and Behavioral Sciences, UTHealth McGovern Medical SchoolDepartment of Biomedical Informatics and Data Science, Yale School of MedicineDepartment of Biomedical Informatics and Data Science, Yale School of MedicineMcWilliams School of Biomedical Informatics, University of Texas Health Science Center at HoustonMcWilliams School of Biomedical Informatics, University of Texas Health Science Center at HoustonSchool of Public Health, University of Texas Health Science Center at HoustonDepartment of Artificial Intelligence and Informatics, Mayo ClinicDepartment of Biomedical Informatics and Data Science, Yale School of MedicineDepartment of Biomedical Informatics and Data Science, Yale School of MedicineDepartment of Artificial Intelligence and Informatics, Mayo ClinicDepartment of Biomedical Informatics and Data Science, Yale School of MedicineDepartment of Biomedical Informatics and Data Science, Yale School of MedicineDepartment of Artificial Intelligence and Informatics, Mayo ClinicMcWilliams School of Biomedical Informatics, University of Texas Health Science Center at HoustonDepartment of Biomedical Informatics and Data Science, Yale School of MedicineAbstract Detailed social determinants of health (SDoH) is often buried within clinical text in EHRs. Most current NLP efforts for SDoH have limitations, investigating limited factors, deriving data from a single institution, using specific patient cohorts/note types, with reduced focus on generalizability. We aim to address these issues by creating cross-institutional corpora and developing and evaluating the generalizability of classification models, including large language models (LLMs), for detecting SDoH factors using data from four institutions. Clinical notes were annotated with 21 SDoH factors at two levels: level 1 (SDoH factors only) and level 2 (SDoH factors and associated values). Compared to other models, instruction tuned LLM achieved top performance with micro-averaged F1 over 0.9 on level 1 corpora and over 0.84 on level 2 corpora. While models performed well when trained and tested on individual datasets, cross-dataset generalization highlighted remaining obstacles. Access to trained models will be made available at https://github.com/BIDS-Xu-Lab/LLMs4SDoH .https://doi.org/10.1038/s41746-025-01645-8 |
| spellingShingle | Vipina K. Keloth Salih Selek Qingyu Chen Christopher Gilman Sunyang Fu Yifang Dang Xinghan Chen Xinyue Hu Yujia Zhou Huan He Jungwei W. Fan Karen Wang Cynthia Brandt Cui Tao Hongfang Liu Hua Xu Social determinants of health extraction from clinical notes across institutions using large language models npj Digital Medicine |
| title | Social determinants of health extraction from clinical notes across institutions using large language models |
| title_full | Social determinants of health extraction from clinical notes across institutions using large language models |
| title_fullStr | Social determinants of health extraction from clinical notes across institutions using large language models |
| title_full_unstemmed | Social determinants of health extraction from clinical notes across institutions using large language models |
| title_short | Social determinants of health extraction from clinical notes across institutions using large language models |
| title_sort | social determinants of health extraction from clinical notes across institutions using large language models |
| url | https://doi.org/10.1038/s41746-025-01645-8 |
| work_keys_str_mv | AT vipinakkeloth socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT salihselek socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT qingyuchen socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT christophergilman socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT sunyangfu socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT yifangdang socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT xinghanchen socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT xinyuehu socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT yujiazhou socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT huanhe socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT jungweiwfan socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT karenwang socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT cynthiabrandt socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT cuitao socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT hongfangliu socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels AT huaxu socialdeterminantsofhealthextractionfromclinicalnotesacrossinstitutionsusinglargelanguagemodels |