Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool

Abstract Background Assessing the methodological quality of case reports and case series is challenging due to human judgment variability and time constraints. We evaluated the agreement in judgments between human reviewers and GPT-4 when applying a standard methodological quality assessment tool de...

Full description

Saved in:
Bibliographic Details
Main Authors: Zin Tarakji, Adel Kanaan, Samer Saadi, Mohammed Firwana, Adel Kabbara Allababidi, Mohamed F. Abusalih, Rami Basmaci, Tamim I. Rajjo, Zhen Wang, M. Hassan Murad, Bashar Hasan
Format: Article
Language:English
Published: BMC 2024-11-01
Series:BMC Medical Research Methodology
Subjects:
Online Access:https://doi.org/10.1186/s12874-024-02372-6
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846171793274961920
author Zin Tarakji
Adel Kanaan
Samer Saadi
Mohammed Firwana
Adel Kabbara Allababidi
Mohamed F. Abusalih
Rami Basmaci
Tamim I. Rajjo
Zhen Wang
M. Hassan Murad
Bashar Hasan
author_facet Zin Tarakji
Adel Kanaan
Samer Saadi
Mohammed Firwana
Adel Kabbara Allababidi
Mohamed F. Abusalih
Rami Basmaci
Tamim I. Rajjo
Zhen Wang
M. Hassan Murad
Bashar Hasan
author_sort Zin Tarakji
collection DOAJ
description Abstract Background Assessing the methodological quality of case reports and case series is challenging due to human judgment variability and time constraints. We evaluated the agreement in judgments between human reviewers and GPT-4 when applying a standard methodological quality assessment tool designed for case reports and series. Methods We searched Scopus for systematic reviews published in 2023–2024 that cited the appraisal tool by Murad et al. A GPT-4 based agent was developed to assess the methodological quality using the 8 signaling questions of the tool. Observed agreement and agreement coefficient were estimated comparing published judgments of human reviewers to GPT-4 assessment. Results We included 797 case reports and series. The observed agreement ranged between 41.91% and 80.93% across the eight questions (agreement coefficient ranged from 25.39 to 79.72%). The lowest agreement was noted in the first signaling question about selection of cases. The agreement was similar in articles published in journals with impact factor < 5 vs. ≥ 5, and when excluding systematic reviews that did not use 3 causality questions. Repeating the analysis using the same prompts demonstrated high agreement between the two GPT-4 attempts except for the first question about selection of cases. Conclusions The study demonstrates a moderate agreement between GPT-4 and human reviewers in assessing the methodological quality of case series and reports using the Murad tool. The current performance of GPT-4 seems promising but unlikely to be sufficient for the rigor of a systematic review and pairing the model with a human reviewer is required.
format Article
id doaj-art-142a5f8953c7496fb9acc0d247cb26d3
institution Kabale University
issn 1471-2288
language English
publishDate 2024-11-01
publisher BMC
record_format Article
series BMC Medical Research Methodology
spelling doaj-art-142a5f8953c7496fb9acc0d247cb26d32024-11-10T12:31:04ZengBMCBMC Medical Research Methodology1471-22882024-11-012411510.1186/s12874-024-02372-6Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad toolZin Tarakji0Adel Kanaan1Samer Saadi2Mohammed Firwana3Adel Kabbara Allababidi4Mohamed F. Abusalih5Rami Basmaci6Tamim I. Rajjo7Zhen Wang8M. Hassan Murad9Bashar Hasan10Evidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicDepartment of Family Medicine, Mayo ClinicDepartment of Family Medicine, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicAbstract Background Assessing the methodological quality of case reports and case series is challenging due to human judgment variability and time constraints. We evaluated the agreement in judgments between human reviewers and GPT-4 when applying a standard methodological quality assessment tool designed for case reports and series. Methods We searched Scopus for systematic reviews published in 2023–2024 that cited the appraisal tool by Murad et al. A GPT-4 based agent was developed to assess the methodological quality using the 8 signaling questions of the tool. Observed agreement and agreement coefficient were estimated comparing published judgments of human reviewers to GPT-4 assessment. Results We included 797 case reports and series. The observed agreement ranged between 41.91% and 80.93% across the eight questions (agreement coefficient ranged from 25.39 to 79.72%). The lowest agreement was noted in the first signaling question about selection of cases. The agreement was similar in articles published in journals with impact factor < 5 vs. ≥ 5, and when excluding systematic reviews that did not use 3 causality questions. Repeating the analysis using the same prompts demonstrated high agreement between the two GPT-4 attempts except for the first question about selection of cases. Conclusions The study demonstrates a moderate agreement between GPT-4 and human reviewers in assessing the methodological quality of case series and reports using the Murad tool. The current performance of GPT-4 seems promising but unlikely to be sufficient for the rigor of a systematic review and pairing the model with a human reviewer is required.https://doi.org/10.1186/s12874-024-02372-6Artificial intelligenceMethodological quality assessmentCase reports and seriesMurad toolSystematic review
spellingShingle Zin Tarakji
Adel Kanaan
Samer Saadi
Mohammed Firwana
Adel Kabbara Allababidi
Mohamed F. Abusalih
Rami Basmaci
Tamim I. Rajjo
Zhen Wang
M. Hassan Murad
Bashar Hasan
Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool
BMC Medical Research Methodology
Artificial intelligence
Methodological quality assessment
Case reports and series
Murad tool
Systematic review
title Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool
title_full Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool
title_fullStr Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool
title_full_unstemmed Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool
title_short Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool
title_sort concordance between humans and gpt 4 in appraising the methodological quality of case reports and case series using the murad tool
topic Artificial intelligence
Methodological quality assessment
Case reports and series
Murad tool
Systematic review
url https://doi.org/10.1186/s12874-024-02372-6
work_keys_str_mv AT zintarakji concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool
AT adelkanaan concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool
AT samersaadi concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool
AT mohammedfirwana concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool
AT adelkabbaraallababidi concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool
AT mohamedfabusalih concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool
AT ramibasmaci concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool
AT tamimirajjo concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool
AT zhenwang concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool
AT mhassanmurad concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool
AT basharhasan concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool