Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool
Abstract Background Assessing the methodological quality of case reports and case series is challenging due to human judgment variability and time constraints. We evaluated the agreement in judgments between human reviewers and GPT-4 when applying a standard methodological quality assessment tool de...
Saved in:
| Main Authors: | , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2024-11-01
|
| Series: | BMC Medical Research Methodology |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s12874-024-02372-6 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1846171793274961920 |
|---|---|
| author | Zin Tarakji Adel Kanaan Samer Saadi Mohammed Firwana Adel Kabbara Allababidi Mohamed F. Abusalih Rami Basmaci Tamim I. Rajjo Zhen Wang M. Hassan Murad Bashar Hasan |
| author_facet | Zin Tarakji Adel Kanaan Samer Saadi Mohammed Firwana Adel Kabbara Allababidi Mohamed F. Abusalih Rami Basmaci Tamim I. Rajjo Zhen Wang M. Hassan Murad Bashar Hasan |
| author_sort | Zin Tarakji |
| collection | DOAJ |
| description | Abstract Background Assessing the methodological quality of case reports and case series is challenging due to human judgment variability and time constraints. We evaluated the agreement in judgments between human reviewers and GPT-4 when applying a standard methodological quality assessment tool designed for case reports and series. Methods We searched Scopus for systematic reviews published in 2023–2024 that cited the appraisal tool by Murad et al. A GPT-4 based agent was developed to assess the methodological quality using the 8 signaling questions of the tool. Observed agreement and agreement coefficient were estimated comparing published judgments of human reviewers to GPT-4 assessment. Results We included 797 case reports and series. The observed agreement ranged between 41.91% and 80.93% across the eight questions (agreement coefficient ranged from 25.39 to 79.72%). The lowest agreement was noted in the first signaling question about selection of cases. The agreement was similar in articles published in journals with impact factor < 5 vs. ≥ 5, and when excluding systematic reviews that did not use 3 causality questions. Repeating the analysis using the same prompts demonstrated high agreement between the two GPT-4 attempts except for the first question about selection of cases. Conclusions The study demonstrates a moderate agreement between GPT-4 and human reviewers in assessing the methodological quality of case series and reports using the Murad tool. The current performance of GPT-4 seems promising but unlikely to be sufficient for the rigor of a systematic review and pairing the model with a human reviewer is required. |
| format | Article |
| id | doaj-art-142a5f8953c7496fb9acc0d247cb26d3 |
| institution | Kabale University |
| issn | 1471-2288 |
| language | English |
| publishDate | 2024-11-01 |
| publisher | BMC |
| record_format | Article |
| series | BMC Medical Research Methodology |
| spelling | doaj-art-142a5f8953c7496fb9acc0d247cb26d32024-11-10T12:31:04ZengBMCBMC Medical Research Methodology1471-22882024-11-012411510.1186/s12874-024-02372-6Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad toolZin Tarakji0Adel Kanaan1Samer Saadi2Mohammed Firwana3Adel Kabbara Allababidi4Mohamed F. Abusalih5Rami Basmaci6Tamim I. Rajjo7Zhen Wang8M. Hassan Murad9Bashar Hasan10Evidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicDepartment of Family Medicine, Mayo ClinicDepartment of Family Medicine, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicEvidence-based Practice Center, Kern Center for the Science of Healthcare Delivery, Mayo ClinicAbstract Background Assessing the methodological quality of case reports and case series is challenging due to human judgment variability and time constraints. We evaluated the agreement in judgments between human reviewers and GPT-4 when applying a standard methodological quality assessment tool designed for case reports and series. Methods We searched Scopus for systematic reviews published in 2023–2024 that cited the appraisal tool by Murad et al. A GPT-4 based agent was developed to assess the methodological quality using the 8 signaling questions of the tool. Observed agreement and agreement coefficient were estimated comparing published judgments of human reviewers to GPT-4 assessment. Results We included 797 case reports and series. The observed agreement ranged between 41.91% and 80.93% across the eight questions (agreement coefficient ranged from 25.39 to 79.72%). The lowest agreement was noted in the first signaling question about selection of cases. The agreement was similar in articles published in journals with impact factor < 5 vs. ≥ 5, and when excluding systematic reviews that did not use 3 causality questions. Repeating the analysis using the same prompts demonstrated high agreement between the two GPT-4 attempts except for the first question about selection of cases. Conclusions The study demonstrates a moderate agreement between GPT-4 and human reviewers in assessing the methodological quality of case series and reports using the Murad tool. The current performance of GPT-4 seems promising but unlikely to be sufficient for the rigor of a systematic review and pairing the model with a human reviewer is required.https://doi.org/10.1186/s12874-024-02372-6Artificial intelligenceMethodological quality assessmentCase reports and seriesMurad toolSystematic review |
| spellingShingle | Zin Tarakji Adel Kanaan Samer Saadi Mohammed Firwana Adel Kabbara Allababidi Mohamed F. Abusalih Rami Basmaci Tamim I. Rajjo Zhen Wang M. Hassan Murad Bashar Hasan Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool BMC Medical Research Methodology Artificial intelligence Methodological quality assessment Case reports and series Murad tool Systematic review |
| title | Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool |
| title_full | Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool |
| title_fullStr | Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool |
| title_full_unstemmed | Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool |
| title_short | Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool |
| title_sort | concordance between humans and gpt 4 in appraising the methodological quality of case reports and case series using the murad tool |
| topic | Artificial intelligence Methodological quality assessment Case reports and series Murad tool Systematic review |
| url | https://doi.org/10.1186/s12874-024-02372-6 |
| work_keys_str_mv | AT zintarakji concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT adelkanaan concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT samersaadi concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT mohammedfirwana concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT adelkabbaraallababidi concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT mohamedfabusalih concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT ramibasmaci concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT tamimirajjo concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT zhenwang concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT mhassanmurad concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool AT basharhasan concordancebetweenhumansandgpt4inappraisingthemethodologicalqualityofcasereportsandcaseseriesusingthemuradtool |