Performance Comparison of Large Language Models for Efficient Literature Screening

<b>Background:</b> Systematic reviewers face a growing body of biomedical literature, making early-stage article screening increasingly time-consuming. In this study, we assessed six large language models (LLMs)—OpenHermes, Flan T5, GPT-2, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o—for th...

Full description

Saved in:
Bibliographic Details
Main Authors: Maria Teresa Colangelo, Stefano Guizzardi, Marco Meleti, Elena Calciolari, Carlo Galli
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:BioMedInformatics
Subjects:
Online Access:https://www.mdpi.com/2673-7426/5/2/25
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850156795275771904
author Maria Teresa Colangelo
Stefano Guizzardi
Marco Meleti
Elena Calciolari
Carlo Galli
author_facet Maria Teresa Colangelo
Stefano Guizzardi
Marco Meleti
Elena Calciolari
Carlo Galli
author_sort Maria Teresa Colangelo
collection DOAJ
description <b>Background:</b> Systematic reviewers face a growing body of biomedical literature, making early-stage article screening increasingly time-consuming. In this study, we assessed six large language models (LLMs)—OpenHermes, Flan T5, GPT-2, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o—for their ability to identify randomized controlled trials (RCTs) in datasets of increasing difficulty. <b>Methods:</b> We first retrieved articles from PubMed and used all-mpnet-base-v2 to measure semantic similarity to known target RCTs, stratifying the collection into quartiles of descending relevance. Each LLM then received either verbose or concise prompts to classify articles as “Accepted” or “Rejected”. <b>Results:</b> Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o consistently achieved high recall, though their precision varied in the quartile with the highest similarity, where false positives increased. By contrast, smaller or older models struggled to balance sensitivity and specificity, with some over-including irrelevant studies or missing key articles. Importantly, multi-stage prompts did not guarantee performance gains for weaker models, whereas single-prompt approaches proved effective for advanced LLMs. <b>Conclusions:</b> These findings underscore that both model capability and prompt design strongly affect classification outcomes, suggesting that newer LLMs, if properly guided, can substantially expedite systematic reviews.
format Article
id doaj-art-ac8ffd254b1a4466a26d175f50ede7a7
institution OA Journals
issn 2673-7426
language English
publishDate 2025-05-01
publisher MDPI AG
record_format Article
series BioMedInformatics
spelling doaj-art-ac8ffd254b1a4466a26d175f50ede7a72025-08-20T02:24:23ZengMDPI AGBioMedInformatics2673-74262025-05-01522510.3390/biomedinformatics5020025Performance Comparison of Large Language Models for Efficient Literature ScreeningMaria Teresa Colangelo0Stefano Guizzardi1Marco Meleti2Elena Calciolari3Carlo Galli4Histology and Embryology Laboratory, Department of Medicine and Surgery, University of Parma, Via Volturno 39, 43126 Parma, ItalyHistology and Embryology Laboratory, Department of Medicine and Surgery, University of Parma, Via Volturno 39, 43126 Parma, ItalyDepartment of Medicine and Surgery, Dental School, University of Parma, 43126 Parma, ItalyDepartment of Medicine and Surgery, Dental School, University of Parma, 43126 Parma, ItalyHistology and Embryology Laboratory, Department of Medicine and Surgery, University of Parma, Via Volturno 39, 43126 Parma, Italy<b>Background:</b> Systematic reviewers face a growing body of biomedical literature, making early-stage article screening increasingly time-consuming. In this study, we assessed six large language models (LLMs)—OpenHermes, Flan T5, GPT-2, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o—for their ability to identify randomized controlled trials (RCTs) in datasets of increasing difficulty. <b>Methods:</b> We first retrieved articles from PubMed and used all-mpnet-base-v2 to measure semantic similarity to known target RCTs, stratifying the collection into quartiles of descending relevance. Each LLM then received either verbose or concise prompts to classify articles as “Accepted” or “Rejected”. <b>Results:</b> Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o consistently achieved high recall, though their precision varied in the quartile with the highest similarity, where false positives increased. By contrast, smaller or older models struggled to balance sensitivity and specificity, with some over-including irrelevant studies or missing key articles. Importantly, multi-stage prompts did not guarantee performance gains for weaker models, whereas single-prompt approaches proved effective for advanced LLMs. <b>Conclusions:</b> These findings underscore that both model capability and prompt design strongly affect classification outcomes, suggesting that newer LLMs, if properly guided, can substantially expedite systematic reviews.https://www.mdpi.com/2673-7426/5/2/25systematic reviewlarge language modelsliterature screeningartificial intelligence
spellingShingle Maria Teresa Colangelo
Stefano Guizzardi
Marco Meleti
Elena Calciolari
Carlo Galli
Performance Comparison of Large Language Models for Efficient Literature Screening
BioMedInformatics
systematic review
large language models
literature screening
artificial intelligence
title Performance Comparison of Large Language Models for Efficient Literature Screening
title_full Performance Comparison of Large Language Models for Efficient Literature Screening
title_fullStr Performance Comparison of Large Language Models for Efficient Literature Screening
title_full_unstemmed Performance Comparison of Large Language Models for Efficient Literature Screening
title_short Performance Comparison of Large Language Models for Efficient Literature Screening
title_sort performance comparison of large language models for efficient literature screening
topic systematic review
large language models
literature screening
artificial intelligence
url https://www.mdpi.com/2673-7426/5/2/25
work_keys_str_mv AT mariateresacolangelo performancecomparisonoflargelanguagemodelsforefficientliteraturescreening
AT stefanoguizzardi performancecomparisonoflargelanguagemodelsforefficientliteraturescreening
AT marcomeleti performancecomparisonoflargelanguagemodelsforefficientliteraturescreening
AT elenacalciolari performancecomparisonoflargelanguagemodelsforefficientliteraturescreening
AT carlogalli performancecomparisonoflargelanguagemodelsforefficientliteraturescreening