Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

Abstract BackgroundThe capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored. ObjectiveThis study evaluates the confidence levels of 12 LLMs across 5 medical specialt...

Full description

Saved in:
Bibliographic Details
Main Authors: Mahmud Omar, Reem Agbareia, Benjamin S Glicksberg, Girish N Nadkarni, Eyal Klang
Format: Article
Language:English
Published: JMIR Publications 2025-05-01
Series:JMIR Medical Informatics
Online Access:https://medinform.jmir.org/2025/1/e66917
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850150886302547968
author Mahmud Omar
Reem Agbareia
Benjamin S Glicksberg
Girish N Nadkarni
Eyal Klang
author_facet Mahmud Omar
Reem Agbareia
Benjamin S Glicksberg
Girish N Nadkarni
Eyal Klang
author_sort Mahmud Omar
collection DOAJ
description Abstract BackgroundThe capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored. ObjectiveThis study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs’ ability to accurately judge their own responses. MethodsWe used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%‐100%). We calculated the correlation between each model’s mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t ResultsThe correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (rPP ConclusionsBetter-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.
format Article
id doaj-art-4828cbf3bbbc444a900f892fa2dcb0c8
institution OA Journals
issn 2291-9694
language English
publishDate 2025-05-01
publisher JMIR Publications
record_format Article
series JMIR Medical Informatics
spelling doaj-art-4828cbf3bbbc444a900f892fa2dcb0c82025-08-20T02:26:26ZengJMIR PublicationsJMIR Medical Informatics2291-96942025-05-0113e66917e6691710.2196/66917Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation StudyMahmud Omarhttp://orcid.org/0009-0001-0438-0827Reem Agbareiahttp://orcid.org/0009-0000-8030-9232Benjamin S Glicksberghttp://orcid.org/0000-0003-4515-8090Girish N Nadkarnihttp://orcid.org/0000-0001-6319-4314Eyal Klanghttp://orcid.org/0000-0002-4567-3108 Abstract BackgroundThe capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored. ObjectiveThis study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs’ ability to accurately judge their own responses. MethodsWe used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%‐100%). We calculated the correlation between each model’s mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t ResultsThe correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (rPP ConclusionsBetter-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.https://medinform.jmir.org/2025/1/e66917
spellingShingle Mahmud Omar
Reem Agbareia
Benjamin S Glicksberg
Girish N Nadkarni
Eyal Klang
Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study
JMIR Medical Informatics
title Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study
title_full Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study
title_fullStr Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study
title_full_unstemmed Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study
title_short Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study
title_sort benchmarking the confidence of large language models in answering clinical questions cross sectional evaluation study
url https://medinform.jmir.org/2025/1/e66917
work_keys_str_mv AT mahmudomar benchmarkingtheconfidenceoflargelanguagemodelsinansweringclinicalquestionscrosssectionalevaluationstudy
AT reemagbareia benchmarkingtheconfidenceoflargelanguagemodelsinansweringclinicalquestionscrosssectionalevaluationstudy
AT benjaminsglicksberg benchmarkingtheconfidenceoflargelanguagemodelsinansweringclinicalquestionscrosssectionalevaluationstudy
AT girishnnadkarni benchmarkingtheconfidenceoflargelanguagemodelsinansweringclinicalquestionscrosssectionalevaluationstudy
AT eyalklang benchmarkingtheconfidenceoflargelanguagemodelsinansweringclinicalquestionscrosssectionalevaluationstudy