Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study

Abstract Background This study evaluates and compares ChatGPT-4.0, Gemini Pro 1.5(0801), Claude 3 Opus, and Qwen 2.0 72B in answering dental implant questions. The aim is to help doctors in underserved areas choose the best LLMs(Large Language Model) for their procedures, improving dental care acces...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuepeng Wu, Yukang Zhang, Mei Xu, Chen Jinzhi, Yican Xue, Yuchen Zheng
Format: Article
Language:English
Published: BMC 2025-03-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:https://doi.org/10.1186/s12911-025-02972-2
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Background This study evaluates and compares ChatGPT-4.0, Gemini Pro 1.5(0801), Claude 3 Opus, and Qwen 2.0 72B in answering dental implant questions. The aim is to help doctors in underserved areas choose the best LLMs(Large Language Model) for their procedures, improving dental care accessibility and clinical decision-making. Methods Two dental implant specialists with over twenty years of clinical experience evaluated the models. Questions were categorized into simple true/false, complex short-answer, and real-life case analyses. Performance was measured using precision, recall, and Bayesian inference-based evaluation metrics. Results ChatGPT-4 exhibited the most stable and consistent performance on both simple and complex questions. Gemini Pro 1.5(0801)performed well on simple questions but was less stable on complex tasks. Qwen 2.0 72B provided high-quality answers for specific cases but showed variability. Claude 3 opus had the lowest performance across various metrics. Statistical analysis indicated significant differences between models in diagnostic performance but not in treatment planning. Conclusions ChatGPT-4 is the most reliable model for handling medical questions, followed by Gemini Pro 1.5(0801). Qwen 2.0 72B shows potential but lacks consistency, and Claude 3 Opus performs poorly overall. Combining multiple models is recommended for comprehensive medical decision-making.
ISSN:1472-6947