Validation of automated paper screening for esophagectomy systematic review using large language models

Background Large language models (LLMs) offer a potential solution to the labor-intensive nature of systematic reviews. This study evaluated the ability of the GPT model to identify articles that discuss perioperative risk factors for esophagectomy complications. To test the performance of the model...

Full description

Saved in:
Bibliographic Details
Main Authors: Rashi Ramchandani, Eddie Guo, Esra Rakab, Jharna Rathod, Jamie Strain, William Klement, Risa Shorr, Erin Williams, Daniel Jones, Sebastien Gilbert
Format: Article
Language:English
Published: PeerJ Inc. 2025-04-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-2822.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850194924265275392
author Rashi Ramchandani
Eddie Guo
Esra Rakab
Jharna Rathod
Jamie Strain
William Klement
Risa Shorr
Erin Williams
Daniel Jones
Sebastien Gilbert
author_facet Rashi Ramchandani
Eddie Guo
Esra Rakab
Jharna Rathod
Jamie Strain
William Klement
Risa Shorr
Erin Williams
Daniel Jones
Sebastien Gilbert
author_sort Rashi Ramchandani
collection DOAJ
description Background Large language models (LLMs) offer a potential solution to the labor-intensive nature of systematic reviews. This study evaluated the ability of the GPT model to identify articles that discuss perioperative risk factors for esophagectomy complications. To test the performance of the model, we tested GPT-4 on narrower inclusion criterion and by assessing its ability to discriminate relevant articles that solely identified preoperative risk factors for esophagectomy. Methods A literature search was run by a trained librarian to identify studies (n = 1,967) discussing risk factors to esophagectomy complications. The articles underwent title and abstract screening by three independent human reviewers and GPT-4. The Python script used for the analysis made Application Programming Interface (API) calls to GPT-4 with screening criteria in natural language. GPT-4’s inclusion and exclusion decision were compared to those decided human reviewers. Results The agreement between the GPT model and human decision was 85.58% for perioperative factors and 78.75% for preoperative factors. The AUC value was 0.87 and 0.75 for the perioperative and preoperative risk factors query, respectively. In the evaluation of perioperative risk factors, the GPT model demonstrated a high recall for included studies at 89%, a positive predictive value of 74%, and a negative predictive value of 84%, with a low false positive rate of 6% and a macro-F1 score of 0.81. For preoperative risk factors, the model showed a recall of 67% for included studies, a positive predictive value of 65%, and a negative predictive value of 85%, with a false positive rate of 15% and a macro-F1 score of 0.66. The interobserver reliability was substantial, with a kappa score of 0.69 for perioperative factors and 0.61 for preoperative factors. Despite lower accuracy under more stringent criteria, the GPT model proved valuable in streamlining the systematic review workflow. Preliminary evaluation of inclusion and exclusion justification provided by the GPT model were reported to have been useful by study screeners, especially in resolving discrepancies during title and abstract screening. Conclusion This study demonstrates promising use of LLMs to streamline the workflow of systematic reviews. The integration of LLMs in systematic reviews could lead to significant time and cost savings, however caution must be taken for reviews involving stringent a narrower and exclusion criterion. Future research is needed and should explore integrating LLMs in other steps of the systematic review, such as full text screening or data extraction, and compare different LLMs for their effectiveness in various types of systematic reviews.
format Article
id doaj-art-4183f4de6a5b4eef8a14ca6755381eb2
institution OA Journals
issn 2376-5992
language English
publishDate 2025-04-01
publisher PeerJ Inc.
record_format Article
series PeerJ Computer Science
spelling doaj-art-4183f4de6a5b4eef8a14ca6755381eb22025-08-20T02:13:53ZengPeerJ Inc.PeerJ Computer Science2376-59922025-04-0111e282210.7717/peerj-cs.2822Validation of automated paper screening for esophagectomy systematic review using large language modelsRashi Ramchandani0Eddie Guo1Esra Rakab2Jharna Rathod3Jamie Strain4William Klement5Risa Shorr6Erin Williams7Daniel Jones8Sebastien Gilbert9Department of Medicine, University of Ottawa, Ottawa, Ontario, CanadaCumming School of Medicine, University of Calgary, Calgary, Alberta, CanadaDepartment of Medicine, University of Ottawa, Ottawa, Ontario, CanadaDepartment of Medicine, University of Ottawa, Ottawa, Ontario, CanadaOttawa Hospital Research Institute, Ottawa, Ontario, CanadaOttawa Hospital Research Institute, Ottawa, Ontario, CanadaLibrary and Learning Services, The Ottawa Hospital, Ottawa, Ontario, CanadaDivision of General Surgery, Department of Surgery, The Ottawa Hospital, Ottawa, Ontario, CanadaDivision of General Surgery, Department of Surgery, The Ottawa Hospital, Ottawa, Ontario, CanadaDivision of General Surgery, Department of Surgery, The Ottawa Hospital, Ottawa, Ontario, CanadaBackground Large language models (LLMs) offer a potential solution to the labor-intensive nature of systematic reviews. This study evaluated the ability of the GPT model to identify articles that discuss perioperative risk factors for esophagectomy complications. To test the performance of the model, we tested GPT-4 on narrower inclusion criterion and by assessing its ability to discriminate relevant articles that solely identified preoperative risk factors for esophagectomy. Methods A literature search was run by a trained librarian to identify studies (n = 1,967) discussing risk factors to esophagectomy complications. The articles underwent title and abstract screening by three independent human reviewers and GPT-4. The Python script used for the analysis made Application Programming Interface (API) calls to GPT-4 with screening criteria in natural language. GPT-4’s inclusion and exclusion decision were compared to those decided human reviewers. Results The agreement between the GPT model and human decision was 85.58% for perioperative factors and 78.75% for preoperative factors. The AUC value was 0.87 and 0.75 for the perioperative and preoperative risk factors query, respectively. In the evaluation of perioperative risk factors, the GPT model demonstrated a high recall for included studies at 89%, a positive predictive value of 74%, and a negative predictive value of 84%, with a low false positive rate of 6% and a macro-F1 score of 0.81. For preoperative risk factors, the model showed a recall of 67% for included studies, a positive predictive value of 65%, and a negative predictive value of 85%, with a false positive rate of 15% and a macro-F1 score of 0.66. The interobserver reliability was substantial, with a kappa score of 0.69 for perioperative factors and 0.61 for preoperative factors. Despite lower accuracy under more stringent criteria, the GPT model proved valuable in streamlining the systematic review workflow. Preliminary evaluation of inclusion and exclusion justification provided by the GPT model were reported to have been useful by study screeners, especially in resolving discrepancies during title and abstract screening. Conclusion This study demonstrates promising use of LLMs to streamline the workflow of systematic reviews. The integration of LLMs in systematic reviews could lead to significant time and cost savings, however caution must be taken for reviews involving stringent a narrower and exclusion criterion. Future research is needed and should explore integrating LLMs in other steps of the systematic review, such as full text screening or data extraction, and compare different LLMs for their effectiveness in various types of systematic reviews.https://peerj.com/articles/cs-2822.pdfSystematic reviewAbstract screeningChatGPTLarge language modelScreening
spellingShingle Rashi Ramchandani
Eddie Guo
Esra Rakab
Jharna Rathod
Jamie Strain
William Klement
Risa Shorr
Erin Williams
Daniel Jones
Sebastien Gilbert
Validation of automated paper screening for esophagectomy systematic review using large language models
PeerJ Computer Science
Systematic review
Abstract screening
ChatGPT
Large language model
Screening
title Validation of automated paper screening for esophagectomy systematic review using large language models
title_full Validation of automated paper screening for esophagectomy systematic review using large language models
title_fullStr Validation of automated paper screening for esophagectomy systematic review using large language models
title_full_unstemmed Validation of automated paper screening for esophagectomy systematic review using large language models
title_short Validation of automated paper screening for esophagectomy systematic review using large language models
title_sort validation of automated paper screening for esophagectomy systematic review using large language models
topic Systematic review
Abstract screening
ChatGPT
Large language model
Screening
url https://peerj.com/articles/cs-2822.pdf
work_keys_str_mv AT rashiramchandani validationofautomatedpaperscreeningforesophagectomysystematicreviewusinglargelanguagemodels
AT eddieguo validationofautomatedpaperscreeningforesophagectomysystematicreviewusinglargelanguagemodels
AT esrarakab validationofautomatedpaperscreeningforesophagectomysystematicreviewusinglargelanguagemodels
AT jharnarathod validationofautomatedpaperscreeningforesophagectomysystematicreviewusinglargelanguagemodels
AT jamiestrain validationofautomatedpaperscreeningforesophagectomysystematicreviewusinglargelanguagemodels
AT williamklement validationofautomatedpaperscreeningforesophagectomysystematicreviewusinglargelanguagemodels
AT risashorr validationofautomatedpaperscreeningforesophagectomysystematicreviewusinglargelanguagemodels
AT erinwilliams validationofautomatedpaperscreeningforesophagectomysystematicreviewusinglargelanguagemodels
AT danieljones validationofautomatedpaperscreeningforesophagectomysystematicreviewusinglargelanguagemodels
AT sebastiengilbert validationofautomatedpaperscreeningforesophagectomysystematicreviewusinglargelanguagemodels