Performance Comparison of Large Language Models for Efficient Literature Screening

<b>Background:</b> Systematic reviewers face a growing body of biomedical literature, making early-stage article screening increasingly time-consuming. In this study, we assessed six large language models (LLMs)—OpenHermes, Flan T5, GPT-2, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o—for th...

Full description

Saved in:

Bibliographic Details
Main Authors:	Maria Teresa Colangelo, Stefano Guizzardi, Marco Meleti, Elena Calciolari, Carlo Galli
Format:	Article
Language:	English
Published:	MDPI AG 2025-05-01
Series:	BioMedInformatics
Subjects:	systematic review large language models literature screening artificial intelligence
Online Access:	https://www.mdpi.com/2673-7426/5/2/25
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850156795275771904
author	Maria Teresa Colangelo Stefano Guizzardi Marco Meleti Elena Calciolari Carlo Galli
author_facet	Maria Teresa Colangelo Stefano Guizzardi Marco Meleti Elena Calciolari Carlo Galli
author_sort	Maria Teresa Colangelo
collection	DOAJ
description	<b>Background:</b> Systematic reviewers face a growing body of biomedical literature, making early-stage article screening increasingly time-consuming. In this study, we assessed six large language models (LLMs)—OpenHermes, Flan T5, GPT-2, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o—for their ability to identify randomized controlled trials (RCTs) in datasets of increasing difficulty. <b>Methods:</b> We first retrieved articles from PubMed and used all-mpnet-base-v2 to measure semantic similarity to known target RCTs, stratifying the collection into quartiles of descending relevance. Each LLM then received either verbose or concise prompts to classify articles as “Accepted” or “Rejected”. <b>Results:</b> Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o consistently achieved high recall, though their precision varied in the quartile with the highest similarity, where false positives increased. By contrast, smaller or older models struggled to balance sensitivity and specificity, with some over-including irrelevant studies or missing key articles. Importantly, multi-stage prompts did not guarantee performance gains for weaker models, whereas single-prompt approaches proved effective for advanced LLMs. <b>Conclusions:</b> These findings underscore that both model capability and prompt design strongly affect classification outcomes, suggesting that newer LLMs, if properly guided, can substantially expedite systematic reviews.
format	Article
id	doaj-art-ac8ffd254b1a4466a26d175f50ede7a7
institution	OA Journals
issn	2673-7426
language	English
publishDate	2025-05-01
publisher	MDPI AG
record_format	Article
series	BioMedInformatics
spelling	doaj-art-ac8ffd254b1a4466a26d175f50ede7a72025-08-20T02:24:23ZengMDPI AGBioMedInformatics2673-74262025-05-01522510.3390/biomedinformatics5020025Performance Comparison of Large Language Models for Efficient Literature ScreeningMaria Teresa Colangelo0Stefano Guizzardi1Marco Meleti2Elena Calciolari3Carlo Galli4Histology and Embryology Laboratory, Department of Medicine and Surgery, University of Parma, Via Volturno 39, 43126 Parma, ItalyHistology and Embryology Laboratory, Department of Medicine and Surgery, University of Parma, Via Volturno 39, 43126 Parma, ItalyDepartment of Medicine and Surgery, Dental School, University of Parma, 43126 Parma, ItalyDepartment of Medicine and Surgery, Dental School, University of Parma, 43126 Parma, ItalyHistology and Embryology Laboratory, Department of Medicine and Surgery, University of Parma, Via Volturno 39, 43126 Parma, Italy<b>Background:</b> Systematic reviewers face a growing body of biomedical literature, making early-stage article screening increasingly time-consuming. In this study, we assessed six large language models (LLMs)—OpenHermes, Flan T5, GPT-2, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o—for their ability to identify randomized controlled trials (RCTs) in datasets of increasing difficulty. <b>Methods:</b> We first retrieved articles from PubMed and used all-mpnet-base-v2 to measure semantic similarity to known target RCTs, stratifying the collection into quartiles of descending relevance. Each LLM then received either verbose or concise prompts to classify articles as “Accepted” or “Rejected”. <b>Results:</b> Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o consistently achieved high recall, though their precision varied in the quartile with the highest similarity, where false positives increased. By contrast, smaller or older models struggled to balance sensitivity and specificity, with some over-including irrelevant studies or missing key articles. Importantly, multi-stage prompts did not guarantee performance gains for weaker models, whereas single-prompt approaches proved effective for advanced LLMs. <b>Conclusions:</b> These findings underscore that both model capability and prompt design strongly affect classification outcomes, suggesting that newer LLMs, if properly guided, can substantially expedite systematic reviews.https://www.mdpi.com/2673-7426/5/2/25systematic reviewlarge language modelsliterature screeningartificial intelligence
spellingShingle	Maria Teresa Colangelo Stefano Guizzardi Marco Meleti Elena Calciolari Carlo Galli Performance Comparison of Large Language Models for Efficient Literature Screening BioMedInformatics systematic review large language models literature screening artificial intelligence
title	Performance Comparison of Large Language Models for Efficient Literature Screening
title_full	Performance Comparison of Large Language Models for Efficient Literature Screening
title_fullStr	Performance Comparison of Large Language Models for Efficient Literature Screening
title_full_unstemmed	Performance Comparison of Large Language Models for Efficient Literature Screening
title_short	Performance Comparison of Large Language Models for Efficient Literature Screening
title_sort	performance comparison of large language models for efficient literature screening
topic	systematic review large language models literature screening artificial intelligence
url	https://www.mdpi.com/2673-7426/5/2/25
work_keys_str_mv	AT mariateresacolangelo performancecomparisonoflargelanguagemodelsforefficientliteraturescreening AT stefanoguizzardi performancecomparisonoflargelanguagemodelsforefficientliteraturescreening AT marcomeleti performancecomparisonoflargelanguagemodelsforefficientliteraturescreening AT elenacalciolari performancecomparisonoflargelanguagemodelsforefficientliteraturescreening AT carlogalli performancecomparisonoflargelanguagemodelsforefficientliteraturescreening

Performance Comparison of Large Language Models for Efficient Literature Screening

Similar Items