Performance Comparison of Large Language Models for Efficient Literature Screening

<b>Background:</b> Systematic reviewers face a growing body of biomedical literature, making early-stage article screening increasingly time-consuming. In this study, we assessed six large language models (LLMs)—OpenHermes, Flan T5, GPT-2, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o—for th...

Full description

Saved in:

Bibliographic Details
Main Authors:	Maria Teresa Colangelo, Stefano Guizzardi, Marco Meleti, Elena Calciolari, Carlo Galli
Format:	Article
Language:	English
Published:	MDPI AG 2025-05-01
Series:	BioMedInformatics
Subjects:	systematic review large language models literature screening artificial intelligence
Online Access:	https://www.mdpi.com/2673-7426/5/2/25
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	<b>Background:</b> Systematic reviewers face a growing body of biomedical literature, making early-stage article screening increasingly time-consuming. In this study, we assessed six large language models (LLMs)—OpenHermes, Flan T5, GPT-2, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o—for their ability to identify randomized controlled trials (RCTs) in datasets of increasing difficulty. <b>Methods:</b> We first retrieved articles from PubMed and used all-mpnet-base-v2 to measure semantic similarity to known target RCTs, stratifying the collection into quartiles of descending relevance. Each LLM then received either verbose or concise prompts to classify articles as “Accepted” or “Rejected”. <b>Results:</b> Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o consistently achieved high recall, though their precision varied in the quartile with the highest similarity, where false positives increased. By contrast, smaller or older models struggled to balance sensitivity and specificity, with some over-including irrelevant studies or missing key articles. Importantly, multi-stage prompts did not guarantee performance gains for weaker models, whereas single-prompt approaches proved effective for advanced LLMs. <b>Conclusions:</b> These findings underscore that both model capability and prompt design strongly affect classification outcomes, suggesting that newer LLMs, if properly guided, can substantially expedite systematic reviews.
ISSN:	2673-7426

Performance Comparison of Large Language Models for Efficient Literature Screening

Similar Items