Performance Comparison of Large Language Models for Efficient Literature Screening
<b>Background:</b> Systematic reviewers face a growing body of biomedical literature, making early-stage article screening increasingly time-consuming. In this study, we assessed six large language models (LLMs)—OpenHermes, Flan T5, GPT-2, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o—for th...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-05-01
|
| Series: | BioMedInformatics |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2673-7426/5/2/25 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850156795275771904 |
|---|---|
| author | Maria Teresa Colangelo Stefano Guizzardi Marco Meleti Elena Calciolari Carlo Galli |
| author_facet | Maria Teresa Colangelo Stefano Guizzardi Marco Meleti Elena Calciolari Carlo Galli |
| author_sort | Maria Teresa Colangelo |
| collection | DOAJ |
| description | <b>Background:</b> Systematic reviewers face a growing body of biomedical literature, making early-stage article screening increasingly time-consuming. In this study, we assessed six large language models (LLMs)—OpenHermes, Flan T5, GPT-2, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o—for their ability to identify randomized controlled trials (RCTs) in datasets of increasing difficulty. <b>Methods:</b> We first retrieved articles from PubMed and used all-mpnet-base-v2 to measure semantic similarity to known target RCTs, stratifying the collection into quartiles of descending relevance. Each LLM then received either verbose or concise prompts to classify articles as “Accepted” or “Rejected”. <b>Results:</b> Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o consistently achieved high recall, though their precision varied in the quartile with the highest similarity, where false positives increased. By contrast, smaller or older models struggled to balance sensitivity and specificity, with some over-including irrelevant studies or missing key articles. Importantly, multi-stage prompts did not guarantee performance gains for weaker models, whereas single-prompt approaches proved effective for advanced LLMs. <b>Conclusions:</b> These findings underscore that both model capability and prompt design strongly affect classification outcomes, suggesting that newer LLMs, if properly guided, can substantially expedite systematic reviews. |
| format | Article |
| id | doaj-art-ac8ffd254b1a4466a26d175f50ede7a7 |
| institution | OA Journals |
| issn | 2673-7426 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | BioMedInformatics |
| spelling | doaj-art-ac8ffd254b1a4466a26d175f50ede7a72025-08-20T02:24:23ZengMDPI AGBioMedInformatics2673-74262025-05-01522510.3390/biomedinformatics5020025Performance Comparison of Large Language Models for Efficient Literature ScreeningMaria Teresa Colangelo0Stefano Guizzardi1Marco Meleti2Elena Calciolari3Carlo Galli4Histology and Embryology Laboratory, Department of Medicine and Surgery, University of Parma, Via Volturno 39, 43126 Parma, ItalyHistology and Embryology Laboratory, Department of Medicine and Surgery, University of Parma, Via Volturno 39, 43126 Parma, ItalyDepartment of Medicine and Surgery, Dental School, University of Parma, 43126 Parma, ItalyDepartment of Medicine and Surgery, Dental School, University of Parma, 43126 Parma, ItalyHistology and Embryology Laboratory, Department of Medicine and Surgery, University of Parma, Via Volturno 39, 43126 Parma, Italy<b>Background:</b> Systematic reviewers face a growing body of biomedical literature, making early-stage article screening increasingly time-consuming. In this study, we assessed six large language models (LLMs)—OpenHermes, Flan T5, GPT-2, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o—for their ability to identify randomized controlled trials (RCTs) in datasets of increasing difficulty. <b>Methods:</b> We first retrieved articles from PubMed and used all-mpnet-base-v2 to measure semantic similarity to known target RCTs, stratifying the collection into quartiles of descending relevance. Each LLM then received either verbose or concise prompts to classify articles as “Accepted” or “Rejected”. <b>Results:</b> Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o consistently achieved high recall, though their precision varied in the quartile with the highest similarity, where false positives increased. By contrast, smaller or older models struggled to balance sensitivity and specificity, with some over-including irrelevant studies or missing key articles. Importantly, multi-stage prompts did not guarantee performance gains for weaker models, whereas single-prompt approaches proved effective for advanced LLMs. <b>Conclusions:</b> These findings underscore that both model capability and prompt design strongly affect classification outcomes, suggesting that newer LLMs, if properly guided, can substantially expedite systematic reviews.https://www.mdpi.com/2673-7426/5/2/25systematic reviewlarge language modelsliterature screeningartificial intelligence |
| spellingShingle | Maria Teresa Colangelo Stefano Guizzardi Marco Meleti Elena Calciolari Carlo Galli Performance Comparison of Large Language Models for Efficient Literature Screening BioMedInformatics systematic review large language models literature screening artificial intelligence |
| title | Performance Comparison of Large Language Models for Efficient Literature Screening |
| title_full | Performance Comparison of Large Language Models for Efficient Literature Screening |
| title_fullStr | Performance Comparison of Large Language Models for Efficient Literature Screening |
| title_full_unstemmed | Performance Comparison of Large Language Models for Efficient Literature Screening |
| title_short | Performance Comparison of Large Language Models for Efficient Literature Screening |
| title_sort | performance comparison of large language models for efficient literature screening |
| topic | systematic review large language models literature screening artificial intelligence |
| url | https://www.mdpi.com/2673-7426/5/2/25 |
| work_keys_str_mv | AT mariateresacolangelo performancecomparisonoflargelanguagemodelsforefficientliteraturescreening AT stefanoguizzardi performancecomparisonoflargelanguagemodelsforefficientliteraturescreening AT marcomeleti performancecomparisonoflargelanguagemodelsforefficientliteraturescreening AT elenacalciolari performancecomparisonoflargelanguagemodelsforefficientliteraturescreening AT carlogalli performancecomparisonoflargelanguagemodelsforefficientliteraturescreening |