Evaluation of the performance of large language models in clinical decision-making in endodontics

Abstract Background Artificial intelligence (AI) chatbots are excellent at generating language. The growing use of generative AI large language models (LLMs) in healthcare and dentistry, including endodontics, raises questions about their accuracy. The potential of LLMs to assist clinicians’ decisio...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yağız Özbay, Deniz Erdoğan, Gözde Akbal Dinçer
Format:	Article
Language:	English
Published:	BMC 2025-04-01
Series:	BMC Oral Health
Subjects:	Chat GPT Chatbot Large Language model Endodontics Endodontology
Online Access:	https://doi.org/10.1186/s12903-025-06050-x
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850042459910832128
author	Yağız Özbay Deniz Erdoğan Gözde Akbal Dinçer
author_facet	Yağız Özbay Deniz Erdoğan Gözde Akbal Dinçer
author_sort	Yağız Özbay
collection	DOAJ
description	Abstract Background Artificial intelligence (AI) chatbots are excellent at generating language. The growing use of generative AI large language models (LLMs) in healthcare and dentistry, including endodontics, raises questions about their accuracy. The potential of LLMs to assist clinicians’ decision-making processes in endodontics is worth evaluating. This study aims to comparatively evaluate the answers provided by Google Bard, ChatGPT-3.5, and ChatGPT-4 to clinically relevant questions from the field of Endodontics. Methods 40 open-ended questions covering different areas of endodontics were prepared and were introduced to Google Bard, ChatGPT-3.5, and ChatGPT-4. Validity of the questions was evaluated using the Lawshe Content Validity Index. Two experienced endodontists, blinded to the chatbots, evaluated the answers using a 3-point Likert scale. All responses deemed to contain factually wrong information were noted and a misinformation rate for each LLM was calculated (number of answers containing wrong information/total number of questions). The One-way analysis of variance and Post Hoc Tukey test were used to analyze the data and significance was considered to be p < 0.05. Results ChatGPT-4 demonstrated the highest score and the lowest misinformation rate (P = 0.008) followed by ChatGPT-3.5 and Google Bard respectively. The difference between ChatGPT-4 and Google Bard was statistically significant (P = 0.004). Conclusion ChatGPT-4 provided more accurate and informative information in endodontics. However, all LLMs produced varying levels of incomplete or incorrect answers.
format	Article
id	doaj-art-fb8dc317c3c842cc9ea76fdb3c5ba764
institution	DOAJ
issn	1472-6831
language	English
publishDate	2025-04-01
publisher	BMC
record_format	Article
series	BMC Oral Health
spelling	doaj-art-fb8dc317c3c842cc9ea76fdb3c5ba7642025-08-20T02:55:32ZengBMCBMC Oral Health1472-68312025-04-012511910.1186/s12903-025-06050-xEvaluation of the performance of large language models in clinical decision-making in endodonticsYağız Özbay0Deniz Erdoğan1Gözde Akbal Dinçer2Department of Endodontics, Faculty of Dentistry, Karabük UniversityPrivate DentistDepartment of Endodontics, Faculty of Dentistry, Okan UniversityAbstract Background Artificial intelligence (AI) chatbots are excellent at generating language. The growing use of generative AI large language models (LLMs) in healthcare and dentistry, including endodontics, raises questions about their accuracy. The potential of LLMs to assist clinicians’ decision-making processes in endodontics is worth evaluating. This study aims to comparatively evaluate the answers provided by Google Bard, ChatGPT-3.5, and ChatGPT-4 to clinically relevant questions from the field of Endodontics. Methods 40 open-ended questions covering different areas of endodontics were prepared and were introduced to Google Bard, ChatGPT-3.5, and ChatGPT-4. Validity of the questions was evaluated using the Lawshe Content Validity Index. Two experienced endodontists, blinded to the chatbots, evaluated the answers using a 3-point Likert scale. All responses deemed to contain factually wrong information were noted and a misinformation rate for each LLM was calculated (number of answers containing wrong information/total number of questions). The One-way analysis of variance and Post Hoc Tukey test were used to analyze the data and significance was considered to be p < 0.05. Results ChatGPT-4 demonstrated the highest score and the lowest misinformation rate (P = 0.008) followed by ChatGPT-3.5 and Google Bard respectively. The difference between ChatGPT-4 and Google Bard was statistically significant (P = 0.004). Conclusion ChatGPT-4 provided more accurate and informative information in endodontics. However, all LLMs produced varying levels of incomplete or incorrect answers.https://doi.org/10.1186/s12903-025-06050-xChat GPTChatbotLarge Language modelEndodonticsEndodontology
spellingShingle	Yağız Özbay Deniz Erdoğan Gözde Akbal Dinçer Evaluation of the performance of large language models in clinical decision-making in endodontics BMC Oral Health Chat GPT Chatbot Large Language model Endodontics Endodontology
title	Evaluation of the performance of large language models in clinical decision-making in endodontics
title_full	Evaluation of the performance of large language models in clinical decision-making in endodontics
title_fullStr	Evaluation of the performance of large language models in clinical decision-making in endodontics
title_full_unstemmed	Evaluation of the performance of large language models in clinical decision-making in endodontics
title_short	Evaluation of the performance of large language models in clinical decision-making in endodontics
title_sort	evaluation of the performance of large language models in clinical decision making in endodontics
topic	Chat GPT Chatbot Large Language model Endodontics Endodontology
url	https://doi.org/10.1186/s12903-025-06050-x
work_keys_str_mv	AT yagızozbay evaluationoftheperformanceoflargelanguagemodelsinclinicaldecisionmakinginendodontics AT denizerdogan evaluationoftheperformanceoflargelanguagemodelsinclinicaldecisionmakinginendodontics AT gozdeakbaldincer evaluationoftheperformanceoflargelanguagemodelsinclinicaldecisionmakinginendodontics

Evaluation of the performance of large language models in clinical decision-making in endodontics

Similar Items