Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination

Abstract BackgroundThe GPT-4 is a large language model (LLM) trained and fine-tuned on an extensive dataset. After the public release of its predecessor in November 2022, the use of LLMs has seen a significant spike in interest, and a multitude of potential use cases have been...

Full description

Saved in:
Bibliographic Details
Main Authors: Julian Madrid, Philipp Diehl, Mischa Selig, Bernd Rolauffs, Felix Patricius Hans, Hans-Jörg Busch, Tobias Scheef, Leo Benning
Format: Article
Language:English
Published: JMIR Publications 2025-03-01
Series:JMIR Medical Education
Online Access:https://mededu.jmir.org/2025/1/e58375
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850066272961691648
author Julian Madrid
Philipp Diehl
Mischa Selig
Bernd Rolauffs
Felix Patricius Hans
Hans-Jörg Busch
Tobias Scheef
Leo Benning
author_facet Julian Madrid
Philipp Diehl
Mischa Selig
Bernd Rolauffs
Felix Patricius Hans
Hans-Jörg Busch
Tobias Scheef
Leo Benning
author_sort Julian Madrid
collection DOAJ
description Abstract BackgroundThe GPT-4 is a large language model (LLM) trained and fine-tuned on an extensive dataset. After the public release of its predecessor in November 2022, the use of LLMs has seen a significant spike in interest, and a multitude of potential use cases have been proposed. In parallel, however, important limitations have been outlined. Particularly, current LLMs encounter limitations, especially in symbolic representation and accessing contemporary data. The recent version of GPT-4, alongside newly released plugin features, has been introduced to mitigate some of these limitations. ObjectiveBefore this background, this work aims to investigate the performance of GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins using pretranslated English text on the German medical board examination. Recognizing the critical importance of quantifying uncertainty for LLM applications in medicine, we furthermore assess this ability and develop a new metric termed “confidence accuracy” to evaluate it. MethodsWe used GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins and translation to answer questions from the German medical board examination. Additionally, we conducted an analysis to assess how the models justify their answers, the accuracy of their responses, and the error structure of their answers. Bootstrapping and CIs were used to evaluate the statistical significance of our findings. ResultsThis study demonstrated that available GPT models, as LLM examples, exceeded the minimum competency threshold established by the German medical board for medical students to obtain board certification to practice medicine. Moreover, the models could assess the uncertainty in their responses, albeit exhibiting overconfidence. Additionally, this work unraveled certain justification and reasoning structures that emerge when GPT generates answers. ConclusionsThe high performance of GPTs in answering medical questions positions it well for applications in academia and, potentially, clinical practice. Its capability to quantify uncertainty in answers suggests it could be a valuable artificial intelligence agent within the clinical decision-making loop. Nevertheless, significant challenges must be addressed before artificial intelligence agents can be robustly and safely implemented in the medical domain.
format Article
id doaj-art-4ac88d7038e5469ea456bf4c3c9bb272
institution DOAJ
issn 2369-3762
language English
publishDate 2025-03-01
publisher JMIR Publications
record_format Article
series JMIR Medical Education
spelling doaj-art-4ac88d7038e5469ea456bf4c3c9bb2722025-08-20T02:48:47ZengJMIR PublicationsJMIR Medical Education2369-37622025-03-0111e58375e5837510.2196/58375Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board ExaminationJulian Madridhttp://orcid.org/0000-0001-5135-6873Philipp Diehlhttp://orcid.org/0000-0001-5495-8511Mischa Selighttp://orcid.org/0009-0004-6194-8871Bernd Rolauffshttp://orcid.org/0000-0002-3275-8196Felix Patricius Hanshttp://orcid.org/0000-0002-7164-7624Hans-Jörg Buschhttp://orcid.org/0000-0002-3897-1908Tobias Scheefhttp://orcid.org/0009-0002-9391-9640Leo Benninghttp://orcid.org/0000-0002-8429-9702 Abstract BackgroundThe GPT-4 is a large language model (LLM) trained and fine-tuned on an extensive dataset. After the public release of its predecessor in November 2022, the use of LLMs has seen a significant spike in interest, and a multitude of potential use cases have been proposed. In parallel, however, important limitations have been outlined. Particularly, current LLMs encounter limitations, especially in symbolic representation and accessing contemporary data. The recent version of GPT-4, alongside newly released plugin features, has been introduced to mitigate some of these limitations. ObjectiveBefore this background, this work aims to investigate the performance of GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins using pretranslated English text on the German medical board examination. Recognizing the critical importance of quantifying uncertainty for LLM applications in medicine, we furthermore assess this ability and develop a new metric termed “confidence accuracy” to evaluate it. MethodsWe used GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins and translation to answer questions from the German medical board examination. Additionally, we conducted an analysis to assess how the models justify their answers, the accuracy of their responses, and the error structure of their answers. Bootstrapping and CIs were used to evaluate the statistical significance of our findings. ResultsThis study demonstrated that available GPT models, as LLM examples, exceeded the minimum competency threshold established by the German medical board for medical students to obtain board certification to practice medicine. Moreover, the models could assess the uncertainty in their responses, albeit exhibiting overconfidence. Additionally, this work unraveled certain justification and reasoning structures that emerge when GPT generates answers. ConclusionsThe high performance of GPTs in answering medical questions positions it well for applications in academia and, potentially, clinical practice. Its capability to quantify uncertainty in answers suggests it could be a valuable artificial intelligence agent within the clinical decision-making loop. Nevertheless, significant challenges must be addressed before artificial intelligence agents can be robustly and safely implemented in the medical domain.https://mededu.jmir.org/2025/1/e58375
spellingShingle Julian Madrid
Philipp Diehl
Mischa Selig
Bernd Rolauffs
Felix Patricius Hans
Hans-Jörg Busch
Tobias Scheef
Leo Benning
Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination
JMIR Medical Education
title Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination
title_full Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination
title_fullStr Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination
title_full_unstemmed Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination
title_short Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination
title_sort performance of plug in augmented chatgpt and its ability to quantify uncertainty simulation study on the german medical board examination
url https://mededu.jmir.org/2025/1/e58375
work_keys_str_mv AT julianmadrid performanceofpluginaugmentedchatgptanditsabilitytoquantifyuncertaintysimulationstudyonthegermanmedicalboardexamination
AT philippdiehl performanceofpluginaugmentedchatgptanditsabilitytoquantifyuncertaintysimulationstudyonthegermanmedicalboardexamination
AT mischaselig performanceofpluginaugmentedchatgptanditsabilitytoquantifyuncertaintysimulationstudyonthegermanmedicalboardexamination
AT berndrolauffs performanceofpluginaugmentedchatgptanditsabilitytoquantifyuncertaintysimulationstudyonthegermanmedicalboardexamination
AT felixpatriciushans performanceofpluginaugmentedchatgptanditsabilitytoquantifyuncertaintysimulationstudyonthegermanmedicalboardexamination
AT hansjorgbusch performanceofpluginaugmentedchatgptanditsabilitytoquantifyuncertaintysimulationstudyonthegermanmedicalboardexamination
AT tobiasscheef performanceofpluginaugmentedchatgptanditsabilitytoquantifyuncertaintysimulationstudyonthegermanmedicalboardexamination
AT leobenning performanceofpluginaugmentedchatgptanditsabilitytoquantifyuncertaintysimulationstudyonthegermanmedicalboardexamination