Benchmarking large language models for biomedical natural language processing applications and recommendations
Abstract The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remain...
Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-04-01
|
| Series: | Nature Communications |
| Online Access: | https://doi.org/10.1038/s41467-025-56989-2 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849734732490735616 |
|---|---|
| author | Qingyu Chen Yan Hu Xueqing Peng Qianqian Xie Qiao Jin Aidan Gilson Maxwell B. Singer Xuguang Ai Po-Ting Lai Zhizheng Wang Vipina K. Keloth Kalpana Raja Jimin Huang Huan He Fongci Lin Jingcheng Du Rui Zhang W. Jim Zheng Ron A. Adelman Zhiyong Lu Hua Xu |
| author_facet | Qingyu Chen Yan Hu Xueqing Peng Qianqian Xie Qiao Jin Aidan Gilson Maxwell B. Singer Xuguang Ai Po-Ting Lai Zhizheng Wang Vipina K. Keloth Kalpana Raja Jimin Huang Huan He Fongci Lin Jingcheng Du Rui Zhang W. Jim Zheng Ron A. Adelman Zhiyong Lu Hua Xu |
| author_sort | Qingyu Chen |
| collection | DOAJ |
| description | Abstract The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs—GPT and LLaMA representatives—on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP. |
| format | Article |
| id | doaj-art-532d4c7fbb00466d816e9724cc0ff599 |
| institution | DOAJ |
| issn | 2041-1723 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Nature Communications |
| spelling | doaj-art-532d4c7fbb00466d816e9724cc0ff5992025-08-20T03:07:43ZengNature PortfolioNature Communications2041-17232025-04-0116111610.1038/s41467-025-56989-2Benchmarking large language models for biomedical natural language processing applications and recommendationsQingyu Chen0Yan Hu1Xueqing Peng2Qianqian Xie3Qiao Jin4Aidan Gilson5Maxwell B. Singer6Xuguang Ai7Po-Ting Lai8Zhizheng Wang9Vipina K. Keloth10Kalpana Raja11Jimin Huang12Huan He13Fongci Lin14Jingcheng Du15Rui Zhang16W. Jim Zheng17Ron A. Adelman18Zhiyong Lu19Hua Xu20Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityMcWilliams School of Biomedical Informatics, University of Texas Health Science at HoustonDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityNational Library of Medicine, National Institutes of HealthDepartment of Ophthalmology and Visual Science, Yale School of Medicine, Yale UniversityDepartment of Ophthalmology and Visual Science, Yale School of Medicine, Yale UniversityDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityNational Library of Medicine, National Institutes of HealthNational Library of Medicine, National Institutes of HealthDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityMcWilliams School of Biomedical Informatics, University of Texas Health Science at HoustonDivision of Computational Health Sciences, Department of Surgery, Medical School, University of MinnesotaMcWilliams School of Biomedical Informatics, University of Texas Health Science at HoustonDepartment of Ophthalmology and Visual Science, Yale School of Medicine, Yale UniversityNational Library of Medicine, National Institutes of HealthDepartment of Biomedical Informatics and Data Science, Yale School of Medicine, Yale UniversityAbstract The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs—GPT and LLaMA representatives—on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP.https://doi.org/10.1038/s41467-025-56989-2 |
| spellingShingle | Qingyu Chen Yan Hu Xueqing Peng Qianqian Xie Qiao Jin Aidan Gilson Maxwell B. Singer Xuguang Ai Po-Ting Lai Zhizheng Wang Vipina K. Keloth Kalpana Raja Jimin Huang Huan He Fongci Lin Jingcheng Du Rui Zhang W. Jim Zheng Ron A. Adelman Zhiyong Lu Hua Xu Benchmarking large language models for biomedical natural language processing applications and recommendations Nature Communications |
| title | Benchmarking large language models for biomedical natural language processing applications and recommendations |
| title_full | Benchmarking large language models for biomedical natural language processing applications and recommendations |
| title_fullStr | Benchmarking large language models for biomedical natural language processing applications and recommendations |
| title_full_unstemmed | Benchmarking large language models for biomedical natural language processing applications and recommendations |
| title_short | Benchmarking large language models for biomedical natural language processing applications and recommendations |
| title_sort | benchmarking large language models for biomedical natural language processing applications and recommendations |
| url | https://doi.org/10.1038/s41467-025-56989-2 |
| work_keys_str_mv | AT qingyuchen benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT yanhu benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT xueqingpeng benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT qianqianxie benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT qiaojin benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT aidangilson benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT maxwellbsinger benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT xuguangai benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT potinglai benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT zhizhengwang benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT vipinakkeloth benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT kalpanaraja benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT jiminhuang benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT huanhe benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT fongcilin benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT jingchengdu benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT ruizhang benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT wjimzheng benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT ronaadelman benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT zhiyonglu benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations AT huaxu benchmarkinglargelanguagemodelsforbiomedicalnaturallanguageprocessingapplicationsandrecommendations |