Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study
Abstract Background This study evaluates and compares ChatGPT-4.0, Gemini Pro 1.5(0801), Claude 3 Opus, and Qwen 2.0 72B in answering dental implant questions. The aim is to help doctors in underserved areas choose the best LLMs(Large Language Model) for their procedures, improving dental care acces...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2025-03-01
|
| Series: | BMC Medical Informatics and Decision Making |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s12911-025-02972-2 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849392383611895808 |
|---|---|
| author | Yuepeng Wu Yukang Zhang Mei Xu Chen Jinzhi Yican Xue Yuchen Zheng |
| author_facet | Yuepeng Wu Yukang Zhang Mei Xu Chen Jinzhi Yican Xue Yuchen Zheng |
| author_sort | Yuepeng Wu |
| collection | DOAJ |
| description | Abstract Background This study evaluates and compares ChatGPT-4.0, Gemini Pro 1.5(0801), Claude 3 Opus, and Qwen 2.0 72B in answering dental implant questions. The aim is to help doctors in underserved areas choose the best LLMs(Large Language Model) for their procedures, improving dental care accessibility and clinical decision-making. Methods Two dental implant specialists with over twenty years of clinical experience evaluated the models. Questions were categorized into simple true/false, complex short-answer, and real-life case analyses. Performance was measured using precision, recall, and Bayesian inference-based evaluation metrics. Results ChatGPT-4 exhibited the most stable and consistent performance on both simple and complex questions. Gemini Pro 1.5(0801)performed well on simple questions but was less stable on complex tasks. Qwen 2.0 72B provided high-quality answers for specific cases but showed variability. Claude 3 opus had the lowest performance across various metrics. Statistical analysis indicated significant differences between models in diagnostic performance but not in treatment planning. Conclusions ChatGPT-4 is the most reliable model for handling medical questions, followed by Gemini Pro 1.5(0801). Qwen 2.0 72B shows potential but lacks consistency, and Claude 3 Opus performs poorly overall. Combining multiple models is recommended for comprehensive medical decision-making. |
| format | Article |
| id | doaj-art-21803b96265146b0a5afe0bbdca3de8f |
| institution | Kabale University |
| issn | 1472-6947 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | BMC |
| record_format | Article |
| series | BMC Medical Informatics and Decision Making |
| spelling | doaj-art-21803b96265146b0a5afe0bbdca3de8f2025-08-20T03:40:47ZengBMCBMC Medical Informatics and Decision Making1472-69472025-03-0125111110.1186/s12911-025-02972-2Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative studyYuepeng Wu0Yukang Zhang1Mei Xu2Chen Jinzhi3Yican Xue4Yuchen Zheng5Center for Plastic & Reconstructive Surgery, Department of Stomatology, Zhejiang Provincial People’s Hospital, Affiliated People’s Hospital, Hangzhou Medical CollegeXianju Traditional Chinese Medicine HospitalHangzhou Dental Hospital, West BranchCollege of Oceanography, HoHai UniversityHangzhou Medical CollegeCenter for Plastic & Reconstructive Surgery, Department of Stomatology, Zhejiang Provincial People’s Hospital, Affiliated People’s Hospital, Hangzhou Medical CollegeAbstract Background This study evaluates and compares ChatGPT-4.0, Gemini Pro 1.5(0801), Claude 3 Opus, and Qwen 2.0 72B in answering dental implant questions. The aim is to help doctors in underserved areas choose the best LLMs(Large Language Model) for their procedures, improving dental care accessibility and clinical decision-making. Methods Two dental implant specialists with over twenty years of clinical experience evaluated the models. Questions were categorized into simple true/false, complex short-answer, and real-life case analyses. Performance was measured using precision, recall, and Bayesian inference-based evaluation metrics. Results ChatGPT-4 exhibited the most stable and consistent performance on both simple and complex questions. Gemini Pro 1.5(0801)performed well on simple questions but was less stable on complex tasks. Qwen 2.0 72B provided high-quality answers for specific cases but showed variability. Claude 3 opus had the lowest performance across various metrics. Statistical analysis indicated significant differences between models in diagnostic performance but not in treatment planning. Conclusions ChatGPT-4 is the most reliable model for handling medical questions, followed by Gemini Pro 1.5(0801). Qwen 2.0 72B shows potential but lacks consistency, and Claude 3 Opus performs poorly overall. Combining multiple models is recommended for comprehensive medical decision-making.https://doi.org/10.1186/s12911-025-02972-2Large language modelsArtificial intelligenceDental implantologyClinical decision-makingCase analysis |
| spellingShingle | Yuepeng Wu Yukang Zhang Mei Xu Chen Jinzhi Yican Xue Yuchen Zheng Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study BMC Medical Informatics and Decision Making Large language models Artificial intelligence Dental implantology Clinical decision-making Case analysis |
| title | Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study |
| title_full | Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study |
| title_fullStr | Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study |
| title_full_unstemmed | Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study |
| title_short | Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study |
| title_sort | effectiveness of various general large language models in clinical consensus and case analysis in dental implantology a comparative study |
| topic | Large language models Artificial intelligence Dental implantology Clinical decision-making Case analysis |
| url | https://doi.org/10.1186/s12911-025-02972-2 |
| work_keys_str_mv | AT yuepengwu effectivenessofvariousgenerallargelanguagemodelsinclinicalconsensusandcaseanalysisindentalimplantologyacomparativestudy AT yukangzhang effectivenessofvariousgenerallargelanguagemodelsinclinicalconsensusandcaseanalysisindentalimplantologyacomparativestudy AT meixu effectivenessofvariousgenerallargelanguagemodelsinclinicalconsensusandcaseanalysisindentalimplantologyacomparativestudy AT chenjinzhi effectivenessofvariousgenerallargelanguagemodelsinclinicalconsensusandcaseanalysisindentalimplantologyacomparativestudy AT yicanxue effectivenessofvariousgenerallargelanguagemodelsinclinicalconsensusandcaseanalysisindentalimplantologyacomparativestudy AT yuchenzheng effectivenessofvariousgenerallargelanguagemodelsinclinicalconsensusandcaseanalysisindentalimplantologyacomparativestudy |