Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study

Abstract Background This study evaluates and compares ChatGPT-4.0, Gemini Pro 1.5(0801), Claude 3 Opus, and Qwen 2.0 72B in answering dental implant questions. The aim is to help doctors in underserved areas choose the best LLMs(Large Language Model) for their procedures, improving dental care acces...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuepeng Wu, Yukang Zhang, Mei Xu, Chen Jinzhi, Yican Xue, Yuchen Zheng
Format: Article
Language:English
Published: BMC 2025-03-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:https://doi.org/10.1186/s12911-025-02972-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849392383611895808
author Yuepeng Wu
Yukang Zhang
Mei Xu
Chen Jinzhi
Yican Xue
Yuchen Zheng
author_facet Yuepeng Wu
Yukang Zhang
Mei Xu
Chen Jinzhi
Yican Xue
Yuchen Zheng
author_sort Yuepeng Wu
collection DOAJ
description Abstract Background This study evaluates and compares ChatGPT-4.0, Gemini Pro 1.5(0801), Claude 3 Opus, and Qwen 2.0 72B in answering dental implant questions. The aim is to help doctors in underserved areas choose the best LLMs(Large Language Model) for their procedures, improving dental care accessibility and clinical decision-making. Methods Two dental implant specialists with over twenty years of clinical experience evaluated the models. Questions were categorized into simple true/false, complex short-answer, and real-life case analyses. Performance was measured using precision, recall, and Bayesian inference-based evaluation metrics. Results ChatGPT-4 exhibited the most stable and consistent performance on both simple and complex questions. Gemini Pro 1.5(0801)performed well on simple questions but was less stable on complex tasks. Qwen 2.0 72B provided high-quality answers for specific cases but showed variability. Claude 3 opus had the lowest performance across various metrics. Statistical analysis indicated significant differences between models in diagnostic performance but not in treatment planning. Conclusions ChatGPT-4 is the most reliable model for handling medical questions, followed by Gemini Pro 1.5(0801). Qwen 2.0 72B shows potential but lacks consistency, and Claude 3 Opus performs poorly overall. Combining multiple models is recommended for comprehensive medical decision-making.
format Article
id doaj-art-21803b96265146b0a5afe0bbdca3de8f
institution Kabale University
issn 1472-6947
language English
publishDate 2025-03-01
publisher BMC
record_format Article
series BMC Medical Informatics and Decision Making
spelling doaj-art-21803b96265146b0a5afe0bbdca3de8f2025-08-20T03:40:47ZengBMCBMC Medical Informatics and Decision Making1472-69472025-03-0125111110.1186/s12911-025-02972-2Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative studyYuepeng Wu0Yukang Zhang1Mei Xu2Chen Jinzhi3Yican Xue4Yuchen Zheng5Center for Plastic & Reconstructive Surgery, Department of Stomatology, Zhejiang Provincial People’s Hospital, Affiliated People’s Hospital, Hangzhou Medical CollegeXianju Traditional Chinese Medicine HospitalHangzhou Dental Hospital, West BranchCollege of Oceanography, HoHai UniversityHangzhou Medical CollegeCenter for Plastic & Reconstructive Surgery, Department of Stomatology, Zhejiang Provincial People’s Hospital, Affiliated People’s Hospital, Hangzhou Medical CollegeAbstract Background This study evaluates and compares ChatGPT-4.0, Gemini Pro 1.5(0801), Claude 3 Opus, and Qwen 2.0 72B in answering dental implant questions. The aim is to help doctors in underserved areas choose the best LLMs(Large Language Model) for their procedures, improving dental care accessibility and clinical decision-making. Methods Two dental implant specialists with over twenty years of clinical experience evaluated the models. Questions were categorized into simple true/false, complex short-answer, and real-life case analyses. Performance was measured using precision, recall, and Bayesian inference-based evaluation metrics. Results ChatGPT-4 exhibited the most stable and consistent performance on both simple and complex questions. Gemini Pro 1.5(0801)performed well on simple questions but was less stable on complex tasks. Qwen 2.0 72B provided high-quality answers for specific cases but showed variability. Claude 3 opus had the lowest performance across various metrics. Statistical analysis indicated significant differences between models in diagnostic performance but not in treatment planning. Conclusions ChatGPT-4 is the most reliable model for handling medical questions, followed by Gemini Pro 1.5(0801). Qwen 2.0 72B shows potential but lacks consistency, and Claude 3 Opus performs poorly overall. Combining multiple models is recommended for comprehensive medical decision-making.https://doi.org/10.1186/s12911-025-02972-2Large language modelsArtificial intelligenceDental implantologyClinical decision-makingCase analysis
spellingShingle Yuepeng Wu
Yukang Zhang
Mei Xu
Chen Jinzhi
Yican Xue
Yuchen Zheng
Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study
BMC Medical Informatics and Decision Making
Large language models
Artificial intelligence
Dental implantology
Clinical decision-making
Case analysis
title Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study
title_full Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study
title_fullStr Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study
title_full_unstemmed Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study
title_short Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study
title_sort effectiveness of various general large language models in clinical consensus and case analysis in dental implantology a comparative study
topic Large language models
Artificial intelligence
Dental implantology
Clinical decision-making
Case analysis
url https://doi.org/10.1186/s12911-025-02972-2
work_keys_str_mv AT yuepengwu effectivenessofvariousgenerallargelanguagemodelsinclinicalconsensusandcaseanalysisindentalimplantologyacomparativestudy
AT yukangzhang effectivenessofvariousgenerallargelanguagemodelsinclinicalconsensusandcaseanalysisindentalimplantologyacomparativestudy
AT meixu effectivenessofvariousgenerallargelanguagemodelsinclinicalconsensusandcaseanalysisindentalimplantologyacomparativestudy
AT chenjinzhi effectivenessofvariousgenerallargelanguagemodelsinclinicalconsensusandcaseanalysisindentalimplantologyacomparativestudy
AT yicanxue effectivenessofvariousgenerallargelanguagemodelsinclinicalconsensusandcaseanalysisindentalimplantologyacomparativestudy
AT yuchenzheng effectivenessofvariousgenerallargelanguagemodelsinclinicalconsensusandcaseanalysisindentalimplantologyacomparativestudy