Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study

BackgroundRecent advancements in artificial intelligence, such as GPT-3.5 Turbo (OpenAI) and GPT-4, have demonstrated significant potential by achieving good scores on text-only United States Medical Licensing Examination (USMLE) exams and effectively answering questions from...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhichao Yang, Zonghai Yao, Mahbuba Tasmin, Parth Vashisht, Won Seok Jang, Feiyun Ouyang, Beining Wang, David McManus, Dan Berlowitz, Hong Yu
Format: Article
Language:English
Published: JMIR Publications 2025-02-01
Series:Journal of Medical Internet Research
Online Access:https://www.jmir.org/2025/1/e65146
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1825201951382437888
author Zhichao Yang
Zonghai Yao
Mahbuba Tasmin
Parth Vashisht
Won Seok Jang
Feiyun Ouyang
Beining Wang
David McManus
Dan Berlowitz
Hong Yu
author_facet Zhichao Yang
Zonghai Yao
Mahbuba Tasmin
Parth Vashisht
Won Seok Jang
Feiyun Ouyang
Beining Wang
David McManus
Dan Berlowitz
Hong Yu
author_sort Zhichao Yang
collection DOAJ
description BackgroundRecent advancements in artificial intelligence, such as GPT-3.5 Turbo (OpenAI) and GPT-4, have demonstrated significant potential by achieving good scores on text-only United States Medical Licensing Examination (USMLE) exams and effectively answering questions from physicians. However, the ability of these models to interpret medical images remains underexplored. ObjectiveThis study aimed to comprehensively evaluate the performance, interpretability, and limitations of GPT-3.5 Turbo, GPT-4, and its successor, GPT-4 Vision (GPT-4V), specifically focusing on GPT-4V’s newly introduced image-understanding feature. By assessing the models on medical licensing examination questions that require image interpretation, we sought to highlight the strengths and weaknesses of GPT-4V in handling complex multimodal clinical information, thereby exposing hidden flaws and providing insights into its readiness for integration into clinical settings. MethodsThis cross-sectional study tested GPT-4V, GPT-4, and ChatGPT-3.5 Turbo on a total of 227 multiple-choice questions with images from USMLE Step 1 (n=19), Step 2 clinical knowledge (n=14), Step 3 (n=18), the Diagnostic Radiology Qualifying Core Exam (DRQCE) (n=26), and AMBOSS question banks (n=150). AMBOSS provided expert-written hints and question difficulty levels. GPT-4V’s accuracy was compared with 2 state-of-the-art large language models, GPT-3.5 Turbo and GPT-4. The quality of the explanations was evaluated by choosing human preference between an explanation by GPT-4V (without hint), an explanation by an expert, or a tie, using 3 qualitative metrics: comprehensive explanation, question information, and image interpretation. To better understand GPT-4V’s explanation ability, we modified a patient case report to resemble a typical “curbside consultation” between physicians. ResultsFor questions with images, GPT-4V achieved an accuracy of 84.2%, 85.7%, 88.9%, and 73.1% in Step 1, Step 2 clinical knowledge, Step 3 of USMLE, and DRQCE, respectively. It outperformed GPT-3.5 Turbo (42.1%, 50%, 50%, 19.2%) and GPT-4 (63.2%, 64.3%, 66.7%, 26.9%). When GPT-4V answered correctly, its explanations were nearly as good as those provided by domain experts from AMBOSS. However, incorrect answers often had poor explanation quality: 18.2% (10/55) contained inaccurate text, 45.5% (25/55) had inference errors, and 76.3% (42/55) demonstrated image misunderstandings. With human expert assistance, GPT-4V reduced errors by an average of 40% (22/55). GPT-4V accuracy improved with hints, maintaining stable performance across difficulty levels, while medical student performance declined as difficulty increased. In a simulated curbside consultation scenario, GPT-4V required multiple specific prompts to interpret complex case data accurately. ConclusionsGPT-4V achieved high accuracy on multiple-choice questions with images, highlighting its potential in medical assessments. However, significant shortcomings were observed in the quality of explanations when questions were answered incorrectly, particularly in the interpretation of images, which could not be efficiently resolved through expert interaction. These findings reveal hidden flaws in the image interpretation capabilities of GPT-4V, underscoring the need for more comprehensive evaluations beyond multiple-choice questions before integrating GPT-4V into clinical settings.
format Article
id doaj-art-d8f17ce032d24bafaaa53494718a6691
institution Kabale University
issn 1438-8871
language English
publishDate 2025-02-01
publisher JMIR Publications
record_format Article
series Journal of Medical Internet Research
spelling doaj-art-d8f17ce032d24bafaaa53494718a66912025-02-07T17:00:51ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-02-0127e6514610.2196/65146Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational StudyZhichao Yanghttps://orcid.org/0000-0002-2797-4257Zonghai Yaohttps://orcid.org/0000-0002-5707-8410Mahbuba Tasminhttps://orcid.org/0000-0003-1884-8838Parth Vashishthttps://orcid.org/0009-0002-5556-7197Won Seok Janghttps://orcid.org/0009-0001-5439-7299Feiyun Ouyanghttps://orcid.org/0000-0002-7061-7351Beining Wanghttps://orcid.org/0009-0006-5209-4848David McManushttps://orcid.org/0000-0002-9343-6203Dan Berlowitzhttps://orcid.org/0000-0002-8783-5611Hong Yuhttps://orcid.org/0000-0001-9263-5035 BackgroundRecent advancements in artificial intelligence, such as GPT-3.5 Turbo (OpenAI) and GPT-4, have demonstrated significant potential by achieving good scores on text-only United States Medical Licensing Examination (USMLE) exams and effectively answering questions from physicians. However, the ability of these models to interpret medical images remains underexplored. ObjectiveThis study aimed to comprehensively evaluate the performance, interpretability, and limitations of GPT-3.5 Turbo, GPT-4, and its successor, GPT-4 Vision (GPT-4V), specifically focusing on GPT-4V’s newly introduced image-understanding feature. By assessing the models on medical licensing examination questions that require image interpretation, we sought to highlight the strengths and weaknesses of GPT-4V in handling complex multimodal clinical information, thereby exposing hidden flaws and providing insights into its readiness for integration into clinical settings. MethodsThis cross-sectional study tested GPT-4V, GPT-4, and ChatGPT-3.5 Turbo on a total of 227 multiple-choice questions with images from USMLE Step 1 (n=19), Step 2 clinical knowledge (n=14), Step 3 (n=18), the Diagnostic Radiology Qualifying Core Exam (DRQCE) (n=26), and AMBOSS question banks (n=150). AMBOSS provided expert-written hints and question difficulty levels. GPT-4V’s accuracy was compared with 2 state-of-the-art large language models, GPT-3.5 Turbo and GPT-4. The quality of the explanations was evaluated by choosing human preference between an explanation by GPT-4V (without hint), an explanation by an expert, or a tie, using 3 qualitative metrics: comprehensive explanation, question information, and image interpretation. To better understand GPT-4V’s explanation ability, we modified a patient case report to resemble a typical “curbside consultation” between physicians. ResultsFor questions with images, GPT-4V achieved an accuracy of 84.2%, 85.7%, 88.9%, and 73.1% in Step 1, Step 2 clinical knowledge, Step 3 of USMLE, and DRQCE, respectively. It outperformed GPT-3.5 Turbo (42.1%, 50%, 50%, 19.2%) and GPT-4 (63.2%, 64.3%, 66.7%, 26.9%). When GPT-4V answered correctly, its explanations were nearly as good as those provided by domain experts from AMBOSS. However, incorrect answers often had poor explanation quality: 18.2% (10/55) contained inaccurate text, 45.5% (25/55) had inference errors, and 76.3% (42/55) demonstrated image misunderstandings. With human expert assistance, GPT-4V reduced errors by an average of 40% (22/55). GPT-4V accuracy improved with hints, maintaining stable performance across difficulty levels, while medical student performance declined as difficulty increased. In a simulated curbside consultation scenario, GPT-4V required multiple specific prompts to interpret complex case data accurately. ConclusionsGPT-4V achieved high accuracy on multiple-choice questions with images, highlighting its potential in medical assessments. However, significant shortcomings were observed in the quality of explanations when questions were answered incorrectly, particularly in the interpretation of images, which could not be efficiently resolved through expert interaction. These findings reveal hidden flaws in the image interpretation capabilities of GPT-4V, underscoring the need for more comprehensive evaluations beyond multiple-choice questions before integrating GPT-4V into clinical settings.https://www.jmir.org/2025/1/e65146
spellingShingle Zhichao Yang
Zonghai Yao
Mahbuba Tasmin
Parth Vashisht
Won Seok Jang
Feiyun Ouyang
Beining Wang
David McManus
Dan Berlowitz
Hong Yu
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study
Journal of Medical Internet Research
title Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study
title_full Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study
title_fullStr Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study
title_full_unstemmed Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study
title_short Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study
title_sort unveiling gpt 4v s hidden challenges behind high accuracy on usmle questions observational study
url https://www.jmir.org/2025/1/e65146
work_keys_str_mv AT zhichaoyang unveilinggpt4vshiddenchallengesbehindhighaccuracyonusmlequestionsobservationalstudy
AT zonghaiyao unveilinggpt4vshiddenchallengesbehindhighaccuracyonusmlequestionsobservationalstudy
AT mahbubatasmin unveilinggpt4vshiddenchallengesbehindhighaccuracyonusmlequestionsobservationalstudy
AT parthvashisht unveilinggpt4vshiddenchallengesbehindhighaccuracyonusmlequestionsobservationalstudy
AT wonseokjang unveilinggpt4vshiddenchallengesbehindhighaccuracyonusmlequestionsobservationalstudy
AT feiyunouyang unveilinggpt4vshiddenchallengesbehindhighaccuracyonusmlequestionsobservationalstudy
AT beiningwang unveilinggpt4vshiddenchallengesbehindhighaccuracyonusmlequestionsobservationalstudy
AT davidmcmanus unveilinggpt4vshiddenchallengesbehindhighaccuracyonusmlequestionsobservationalstudy
AT danberlowitz unveilinggpt4vshiddenchallengesbehindhighaccuracyonusmlequestionsobservationalstudy
AT hongyu unveilinggpt4vshiddenchallengesbehindhighaccuracyonusmlequestionsobservationalstudy