Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

Abstract BackgroundThe capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored. ObjectiveThis study evaluates the confidence levels of 12 LLMs across 5 medical specialt...

Full description

Saved in:

Bibliographic Details
Main Authors:	Mahmud Omar, Reem Agbareia, Benjamin S Glicksberg, Girish N Nadkarni, Eyal Klang
Format:	Article
Language:	English
Published:	JMIR Publications 2025-05-01
Series:	JMIR Medical Informatics
Online Access:	https://medinform.jmir.org/2025/1/e66917
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850150886302547968
author	Mahmud Omar Reem Agbareia Benjamin S Glicksberg Girish N Nadkarni Eyal Klang
author_facet	Mahmud Omar Reem Agbareia Benjamin S Glicksberg Girish N Nadkarni Eyal Klang
author_sort	Mahmud Omar
collection	DOAJ
description	Abstract BackgroundThe capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored. ObjectiveThis study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs’ ability to accurately judge their own responses. MethodsWe used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%‐100%). We calculated the correlation between each model’s mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t ResultsThe correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (rPP ConclusionsBetter-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.
format	Article
id	doaj-art-4828cbf3bbbc444a900f892fa2dcb0c8
institution	OA Journals
issn	2291-9694
language	English
publishDate	2025-05-01
publisher	JMIR Publications
record_format	Article
series	JMIR Medical Informatics
spelling	doaj-art-4828cbf3bbbc444a900f892fa2dcb0c82025-08-20T02:26:26ZengJMIR PublicationsJMIR Medical Informatics2291-96942025-05-0113e66917e6691710.2196/66917Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation StudyMahmud Omarhttp://orcid.org/0009-0001-0438-0827Reem Agbareiahttp://orcid.org/0009-0000-8030-9232Benjamin S Glicksberghttp://orcid.org/0000-0003-4515-8090Girish N Nadkarnihttp://orcid.org/0000-0001-6319-4314Eyal Klanghttp://orcid.org/0000-0002-4567-3108 Abstract BackgroundThe capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored. ObjectiveThis study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs’ ability to accurately judge their own responses. MethodsWe used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%‐100%). We calculated the correlation between each model’s mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t ResultsThe correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (rPP ConclusionsBetter-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.https://medinform.jmir.org/2025/1/e66917
spellingShingle	Mahmud Omar Reem Agbareia Benjamin S Glicksberg Girish N Nadkarni Eyal Klang Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study JMIR Medical Informatics
title	Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study
title_full	Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study
title_fullStr	Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study
title_full_unstemmed	Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study
title_short	Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study
title_sort	benchmarking the confidence of large language models in answering clinical questions cross sectional evaluation study
url	https://medinform.jmir.org/2025/1/e66917
work_keys_str_mv	AT mahmudomar benchmarkingtheconfidenceoflargelanguagemodelsinansweringclinicalquestionscrosssectionalevaluationstudy AT reemagbareia benchmarkingtheconfidenceoflargelanguagemodelsinansweringclinicalquestionscrosssectionalevaluationstudy AT benjaminsglicksberg benchmarkingtheconfidenceoflargelanguagemodelsinansweringclinicalquestionscrosssectionalevaluationstudy AT girishnnadkarni benchmarkingtheconfidenceoflargelanguagemodelsinansweringclinicalquestionscrosssectionalevaluationstudy AT eyalklang benchmarkingtheconfidenceoflargelanguagemodelsinansweringclinicalquestionscrosssectionalevaluationstudy

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

Similar Items