Evaluating GPT-4 as an academic support tool for clinicians: a comparative analysis of case records from the literature

Background: Artificial intelligence (AI) and natural language processing (NLP) advancements have led to sophisticated tools like GPT-4.0, allowing clinicians to explore its utility as a health care management support tool. Our study aimed to assess the capability of GPT-4 in suggesting definitive di...

Full description

Saved in:
Bibliographic Details
Main Authors: B.L. Fabre, M.A.F. Magalhaes Filho, P.N. Aguiar, Jr, F.M. da Costa, B. Gutierres, W.N. William, Jr, A. Del Giglio
Format: Article
Language:English
Published: Elsevier 2024-06-01
Series:ESMO Real World Data and Digital Oncology
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2949820124000201
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Background: Artificial intelligence (AI) and natural language processing (NLP) advancements have led to sophisticated tools like GPT-4.0, allowing clinicians to explore its utility as a health care management support tool. Our study aimed to assess the capability of GPT-4 in suggesting definitive diagnoses and appropriate work-ups to minimize unnecessary procedures. Materials and methods: We conducted a retrospective comparative analysis, extracting clinical data from 10 cases published in the New England Journal of Medicine after 2022 and inputting this data into GPT-4 to generate diagnostic and work-up recommendations. Primary endpoint: the ability to correctly identify the final diagnosis. Secondary endpoints: its ability to list the definitive diagnosis as the first of the five most likely differential diagnoses and determine an adequate work-up. Results: The AI could not identify the definitive diagnosis in 2 out of 10 cases (20% inaccuracy). Among the eight cases correctly identified by the AI, five (63%) listed the definitive diagnosis at the top of the differential diagnosis list. In terms of diagnostic tests and exams, the AI suggested unnecessary procedures in two cases, representing 40% of the cases where it failed to correctly identify the final diagnosis. Moreover, the AI could not suggest adequate treatment for seven cases (70%). Among them, the AI suggested inappropriate management for two cases, and the remaining five received incomplete or non-specific advice, such as chemotherapy, without specifying the best regimen. Conclusions: Our study demonstrated GPT-4’s potential as an academic support tool, although it cannot correctly identify the final diagnosis in 20% of the cases and the AI requested unnecessary additional diagnostic tests for 40% of the patients. Future research should focus on evaluating the performance of GPT-4 using a more extensive and diverse sample, incorporating prospective assessments, and investigating its ability to improve diagnostic and therapeutic accuracy.
ISSN:2949-8201