Evaluation of Chatbot Responses to Text-Based Multiple-Choice Questions in Prosthodontic and Restorative Dentistry

<b>Background/Objectives</b>: This study aims to evaluate the response accuracy and quality of three AI chatbots—GPT-4.0, Claude-2, and Llama-2—in answering multiple-choice questions in prosthodontic and restorative dentistry. <b>Methods</b>: A total of 191 text-based multipl...

Full description

Saved in:
Bibliographic Details
Main Authors: Reinhard Chun Wang Chau, Khaing Myat Thu, Ollie Yiru Yu, Richard Tai-Chiu Hsung, Denny Chon Pei Wang, Manuel Wing Ho Man, John Junwen Wang, Walter Yu Hang Lam
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Dentistry Journal
Subjects:
Online Access:https://www.mdpi.com/2304-6767/13/7/279
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:<b>Background/Objectives</b>: This study aims to evaluate the response accuracy and quality of three AI chatbots—GPT-4.0, Claude-2, and Llama-2—in answering multiple-choice questions in prosthodontic and restorative dentistry. <b>Methods</b>: A total of 191 text-based multiple-choice questions were selected from the prosthodontic and restorative dentistry sections of the United States Integrated National Board Dental Examination (INBDE) (n = 80) and the United Kingdom Overseas Registration Examination (ORE) (<i>n</i> = 111). These questions were inputted into the chatbots, and the AI-generated answers were compared with the official answer keys to determine their accuracy. Additionally, two dental specialists independently evaluated the rationales accompanying each chatbot response for accuracy, relevance, and comprehensiveness, categorizing them into four distinct ratings. Chi-square and post hoc Z-tests with Bonferroni adjustment were used to analyze the responses. The inter-rater reliability for evaluating the quality of the rationale ratings among specialists was assessed using Cohen’s Kappa (κ). <b>Results</b>: GPT-4.0 (65.4%; <i>n</i> = 125/191) demonstrated a significantly higher proportion of correctly answered multiple-choice questions when compared to Claude-2 (41.9%; <i>n</i> = 80/191) (<i>p</i> < 0.017) and Llama-2 (26.2%; <i>n</i> = 50/191) (<i>p</i> < 0.017). Significant differences were observed in the answer accuracy among all of the chatbots (<i>p</i> < 0.001). In terms of the rationale quality, GPT-4.0 (58.1%; <i>n</i> = 111/191) had a significantly higher proportion of “Correct Answer, Correct Rationale” than Claude-2 (37.2%; <i>n</i> = 71/191) (<i>p</i> < 0.017) and Llama-2 (24.1%; <i>n</i> = 46/191) (<i>p</i> < 0.017). Significant differences were observed in the rationale quality among all of the chatbots (<i>p</i> < 0.001). The inter-rater reliability was very high (κ = 0.83). <b>Conclusions</b>: GPT-4.0 demonstrated the highest accuracy and quality of reasoning in responding to prosthodontic and restorative dentistry questions. This underscores the varying efficacy of AI chatbots within specialized dental contexts.
ISSN:2304-6767