Benchmarking large language models for biomedical natural language processing applications and recommendations

Abstract The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remain...

Full description

Saved in:
Bibliographic Details
Main Authors: Qingyu Chen, Yan Hu, Xueqing Peng, Qianqian Xie, Qiao Jin, Aidan Gilson, Maxwell B. Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, Vipina K. Keloth, Kalpana Raja, Jimin Huang, Huan He, Fongci Lin, Jingcheng Du, Rui Zhang, W. Jim Zheng, Ron A. Adelman, Zhiyong Lu, Hua Xu
Format: Article
Language:English
Published: Nature Portfolio 2025-04-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-025-56989-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849734732490735616
author Qingyu Chen
Yan Hu
Xueqing Peng
Qianqian Xie
Qiao Jin
Aidan Gilson
Maxwell B. Singer
Xuguang Ai
Po-Ting Lai
Zhizheng Wang
Vipina K. Keloth
Kalpana Raja
Jimin Huang
Huan He
Fongci Lin
Jingcheng Du
Rui Zhang
W. Jim Zheng
Ron A. Adelman
Zhiyong Lu
Hua Xu
author_facet Qingyu Chen
Yan Hu
Xueqing Peng
Qianqian Xie
Qiao Jin
Aidan Gilson
Maxwell B. Singer
Xuguang Ai
Po-Ting Lai
Zhizheng Wang
Vipina K. Keloth
Kalpana Raja
Jimin Huang
Huan He
Fongci Lin
Jingcheng Du
Rui Zhang
W. Jim Zheng
Ron A. Adelman
Zhiyong Lu
Hua Xu
author_sort Qingyu Chen
collection DOAJ
description Abstract The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs—GPT and LLaMA representatives—on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP.
format Article
id doaj-art-532d4c7fbb00466d816e9724cc0ff599
institution DOAJ
issn 2041-1723
language English
publishDate 2025-04-01
publisher Nature Portfolio
record_format Article
series Nature Communications
spelling doaj-art-532d4c7fbb00466d816e9724cc0ff5992025-08-20T03:07:43ZengNature PortfolioNature Communications2041-17232025-04-0116111610.1038/s41467-025-56989-2Benchmarking large language models for biomedical natural language processing applications and recommendationsQingyu Chen0Yan Hu1Xueqing Peng2Qianqian Xie3Qiao Jin4Aidan Gilson5Maxwell B. Singer6Xuguang Ai7Po-Ting Lai8Zhizheng Wang9Vipina K. Keloth10Kalpana Raja11Jimin Huang12Huan He13Fongci Lin14Jingcheng Du15Rui Zhang16W. Jim Zheng17Ron A. Adelman18Zhiyong Lu19Hua Xu20Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityMcWilliams School of Biomedical Informatics, University of Texas Health Science at HoustonDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityNational Library of Medicine, National Institutes of HealthDepartment of Ophthalmology and Visual Science, Yale School of Medicine, Yale UniversityDepartment of Ophthalmology and Visual Science, Yale School of Medicine, Yale UniversityDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityNational Library of Medicine, National Institutes of HealthNational Library of Medicine, National Institutes of HealthDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityMcWilliams School of Biomedical Informatics, University of Texas Health Science at HoustonDivision of Computational Health Sciences, Department of Surgery, Medical School, University of MinnesotaMcWilliams School of Biomedical Informatics, University of Texas Health Science at HoustonDepartment of Ophthalmology and Visual Science, Yale School of Medicine, Yale UniversityNational Library of Medicine, National Institutes of HealthDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityAbstract The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs—GPT and LLaMA representatives—on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP.https://doi.org/10.1038/s41467-025-56989-2
spellingShingle Qingyu Chen
Yan Hu
Xueqing Peng
Qianqian Xie
Qiao Jin
Aidan Gilson
Maxwell B. Singer
Xuguang Ai
Po-Ting Lai
Zhizheng Wang
Vipina K. Keloth
Kalpana Raja
Jimin Huang
Huan He
Fongci Lin
Jingcheng Du
Rui Zhang
W. Jim Zheng
Ron A. Adelman
Zhiyong Lu
Hua Xu
Benchmarking large language models for biomedical natural language processing applications and recommendations
Nature Communications
title Benchmarking large language models for biomedical natural language processing applications and recommendations
title_full Benchmarking large language models for biomedical natural language processing applications and recommendations
title_fullStr Benchmarking large language models for biomedical natural language processing applications and recommendations
title_full_unstemmed Benchmarking large language models for biomedical natural language processing applications and recommendations
title_short Benchmarking large language models for biomedical natural language processing applications and recommendations
title_sort benchmarking large language models for biomedical natural language processing applications and recommendations
url https://doi.org/10.1038/s41467-025-56989-2
work_keys_str_mv AT qingyuchen benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT yanhu benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT xueqingpeng benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT qianqianxie benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT qiaojin benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT aidangilson benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT maxwellbsinger benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT xuguangai benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT potinglai benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT zhizhengwang benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT vipinakkeloth benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT kalpanaraja benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT jiminhuang benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT huanhe benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT fongcilin benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT jingchengdu benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT ruizhang benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT wjimzheng benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT ronaadelman benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT zhiyonglu benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations
AT huaxu benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations