Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool

Abstract Background Assessing the methodological quality of case reports and case series is challenging due to human judgment variability and time constraints. We evaluated the agreement in judgments between human reviewers and GPT-4 when applying a standard methodological quality assessment tool de...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zin Tarakji, Adel Kanaan, Samer Saadi, Mohammed Firwana, Adel Kabbara Allababidi, Mohamed F. Abusalih, Rami Basmaci, Tamim I. Rajjo, Zhen Wang, M. Hassan Murad, Bashar Hasan
Format:	Article
Language:	English
Published:	BMC 2024-11-01
Series:	BMC Medical Research Methodology
Subjects:	Artificial intelligence Methodological quality assessment Case reports and series Murad tool Systematic review
Online Access:	https://doi.org/10.1186/s12874-024-02372-6
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846171793274961920
author	Zin Tarakji Adel Kanaan Samer Saadi Mohammed Firwana Adel Kabbara Allababidi Mohamed F. Abusalih Rami Basmaci Tamim I. Rajjo Zhen Wang M. Hassan Murad Bashar Hasan
author_facet	Zin Tarakji Adel Kanaan Samer Saadi Mohammed Firwana Adel Kabbara Allababidi Mohamed F. Abusalih Rami Basmaci Tamim I. Rajjo Zhen Wang M. Hassan Murad Bashar Hasan
author_sort	Zin Tarakji
collection	DOAJ
description	Abstract Background Assessing the methodological quality of case reports and case series is challenging due to human judgment variability and time constraints. We evaluated the agreement in judgments between human reviewers and GPT-4 when applying a standard methodological quality assessment tool designed for case reports and series. Methods We searched Scopus for systematic reviews published in 2023–2024 that cited the appraisal tool by Murad et al. A GPT-4 based agent was developed to assess the methodological quality using the 8 signaling questions of the tool. Observed agreement and agreement coefficient were estimated comparing published judgments of human reviewers to GPT-4 assessment. Results We included 797 case reports and series. The observed agreement ranged between 41.91% and 80.93% across the eight questions (agreement coefficient ranged from 25.39 to 79.72%). The lowest agreement was noted in the first signaling question about selection of cases. The agreement was similar in articles published in journals with impact factor < 5 vs. ≥ 5, and when excluding systematic reviews that did not use 3 causality questions. Repeating the analysis using the same prompts demonstrated high agreement between the two GPT-4 attempts except for the first question about selection of cases. Conclusions The study demonstrates a moderate agreement between GPT-4 and human reviewers in assessing the methodological quality of case series and reports using the Murad tool. The current performance of GPT-4 seems promising but unlikely to be sufficient for the rigor of a systematic review and pairing the model with a human reviewer is required.
format	Article
id	doaj-art-142a5f8953c7496fb9acc0d247cb26d3
institution	Kabale University
issn	1471-2288
language	English
publishDate	2024-11-01
publisher	BMC
record_format	Article
series	BMC Medical Research Methodology
spelling	doaj-art-142a5f8953c7496fb9acc0d247cb26d32024-11-10T12:31:04ZengBMCBMC Medical Research Methodology1471-22882024-11-012411510.1186/s12874-024-02372-6Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad toolZin Tarakji0Adel Kanaan1Samer Saadi2Mohammed Firwana3Adel Kabbara Allababidi4Mohamed F. Abusalih5Rami Basmaci6Tamim I. Rajjo7Zhen Wang8M. Hassan Murad9Bashar Hasan10Evidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicDepartment of Family Medicine, Mayo ClinicDepartment of Family Medicine, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicAbstract Background Assessing the methodological quality of case reports and case series is challenging due to human judgment variability and time constraints. We evaluated the agreement in judgments between human reviewers and GPT-4 when applying a standard methodological quality assessment tool designed for case reports and series. Methods We searched Scopus for systematic reviews published in 2023–2024 that cited the appraisal tool by Murad et al. A GPT-4 based agent was developed to assess the methodological quality using the 8 signaling questions of the tool. Observed agreement and agreement coefficient were estimated comparing published judgments of human reviewers to GPT-4 assessment. Results We included 797 case reports and series. The observed agreement ranged between 41.91% and 80.93% across the eight questions (agreement coefficient ranged from 25.39 to 79.72%). The lowest agreement was noted in the first signaling question about selection of cases. The agreement was similar in articles published in journals with impact factor < 5 vs. ≥ 5, and when excluding systematic reviews that did not use 3 causality questions. Repeating the analysis using the same prompts demonstrated high agreement between the two GPT-4 attempts except for the first question about selection of cases. Conclusions The study demonstrates a moderate agreement between GPT-4 and human reviewers in assessing the methodological quality of case series and reports using the Murad tool. The current performance of GPT-4 seems promising but unlikely to be sufficient for the rigor of a systematic review and pairing the model with a human reviewer is required.https://doi.org/10.1186/s12874-024-02372-6Artificial intelligenceMethodological quality assessmentCase reports and seriesMurad toolSystematic review
spellingShingle	Zin Tarakji Adel Kanaan Samer Saadi Mohammed Firwana Adel Kabbara Allababidi Mohamed F. Abusalih Rami Basmaci Tamim I. Rajjo Zhen Wang M. Hassan Murad Bashar Hasan Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool BMC Medical Research Methodology Artificial intelligence Methodological quality assessment Case reports and series Murad tool Systematic review
title	Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool
title_full	Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool
title_fullStr	Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool
title_full_unstemmed	Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool
title_short	Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool
title_sort	concordance between humans and gpt 4 in appraising the methodological quality of case reports and case series using the murad tool
topic	Artificial intelligence Methodological quality assessment Case reports and series Murad tool Systematic review
url	https://doi.org/10.1186/s12874-024-02372-6
work_keys_str_mv	AT zintarakji concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT adelkanaan concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT samersaadi concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT mohammedfirwana concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT adelkabbaraallababidi concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT mohamedfabusalih concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT ramibasmaci concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT tamimirajjo concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT zhenwang concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT mhassanmurad concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT basharhasan concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool

Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool

Similar Items