Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis

Abstract Background Artificial intelligence (AI) has rapidly advanced in healthcare and dental education, significantly impacting diagnostic processes, treatment planning, and academic training. The aim of this study is to evaluate the performance differences between different large language models...

Full description

Saved in:
Bibliographic Details
Main Authors: Birkan Eyup Yilmaz, Busra Nur Gokkurt Yilmaz, Furkan Ozbey
Format: Article
Language:English
Published: BMC 2025-04-01
Series:BMC Oral Health
Subjects:
Online Access:https://doi.org/10.1186/s12903-025-05926-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850181082488504320
author Birkan Eyup Yilmaz
Busra Nur Gokkurt Yilmaz
Furkan Ozbey
author_facet Birkan Eyup Yilmaz
Busra Nur Gokkurt Yilmaz
Furkan Ozbey
author_sort Birkan Eyup Yilmaz
collection DOAJ
description Abstract Background Artificial intelligence (AI) has rapidly advanced in healthcare and dental education, significantly impacting diagnostic processes, treatment planning, and academic training. The aim of this study is to evaluate the performance differences between different large language models (LLMs) by analyzing their accuracy rates in answers to multiple choice oral pathology questions. Methods This study evaluates the performance of eight LLMs (Gemini 1.5, Gemini 2, ChatGPT 4o, ChatGPT 4, ChatGPT o1, Copilot, Claude 3.5, Deepseek) in answering multiple-choice oral pathology questions from the Turkish Dental Specialization Examination (DUS). A total of 100 questions from 2012 to 2021 were analyzed. Questions were classified as “case-based” or “knowledge-based”. The responses were classified as “correct” or “incorrect” based on official answer keys. To prevent learning biases, no follow-up questions or feedback were provided after the LLMs’ responses. Results Significant performance differences were observed among the models (p < 0.001). ChatGPT o1 achieved the highest accuracy (96 correct, 4 incorrect), followed by Claude (84 correct), Gemini 2 and Deepseek (82 correct each). Copilot had the lowest performance (61 correct). Case-based questions showed notable performance variations (p = 0.034), where ChatGPT o1 and Claude excelled. For knowledge-based questions, ChatGPT o1 and Deepseek demonstrated the highest accuracy (p < 0.001). Post-hoc analysis revealed that ChatGPT o1 performed significantly better than most other models across both case-based and knowledge-based questions (p < 0.0031). Conclusion LLMs demonstrated variable proficiency in oral pathology questions, with ChatGPT o1 showing higher accuracy. LLMs shows promise as a supplementary educational tool, though further validation is required.
format Article
id doaj-art-b94341c7b10848ab9ac904b2dddabc46
institution OA Journals
issn 1472-6831
language English
publishDate 2025-04-01
publisher BMC
record_format Article
series BMC Oral Health
spelling doaj-art-b94341c7b10848ab9ac904b2dddabc462025-08-20T02:17:59ZengBMCBMC Oral Health1472-68312025-04-012511710.1186/s12903-025-05926-2Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysisBirkan Eyup Yilmaz0Busra Nur Gokkurt Yilmaz1Furkan Ozbey2Faculty of Dentistry, Department of Oral and Maxillofacial Surgery, Giresun UniversityGiresun Oral and Dental Health Centre, Department of Dentomaxillofacial RadiologyFaculty of Dentistry, Department of Dentomaxillofacial Radiology, Afyonkarahisar Health Sciences UniversityAbstract Background Artificial intelligence (AI) has rapidly advanced in healthcare and dental education, significantly impacting diagnostic processes, treatment planning, and academic training. The aim of this study is to evaluate the performance differences between different large language models (LLMs) by analyzing their accuracy rates in answers to multiple choice oral pathology questions. Methods This study evaluates the performance of eight LLMs (Gemini 1.5, Gemini 2, ChatGPT 4o, ChatGPT 4, ChatGPT o1, Copilot, Claude 3.5, Deepseek) in answering multiple-choice oral pathology questions from the Turkish Dental Specialization Examination (DUS). A total of 100 questions from 2012 to 2021 were analyzed. Questions were classified as “case-based” or “knowledge-based”. The responses were classified as “correct” or “incorrect” based on official answer keys. To prevent learning biases, no follow-up questions or feedback were provided after the LLMs’ responses. Results Significant performance differences were observed among the models (p < 0.001). ChatGPT o1 achieved the highest accuracy (96 correct, 4 incorrect), followed by Claude (84 correct), Gemini 2 and Deepseek (82 correct each). Copilot had the lowest performance (61 correct). Case-based questions showed notable performance variations (p = 0.034), where ChatGPT o1 and Claude excelled. For knowledge-based questions, ChatGPT o1 and Deepseek demonstrated the highest accuracy (p < 0.001). Post-hoc analysis revealed that ChatGPT o1 performed significantly better than most other models across both case-based and knowledge-based questions (p < 0.0031). Conclusion LLMs demonstrated variable proficiency in oral pathology questions, with ChatGPT o1 showing higher accuracy. LLMs shows promise as a supplementary educational tool, though further validation is required.https://doi.org/10.1186/s12903-025-05926-2Artificial intelligenceOral pathologyLarge language models
spellingShingle Birkan Eyup Yilmaz
Busra Nur Gokkurt Yilmaz
Furkan Ozbey
Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis
BMC Oral Health
Artificial intelligence
Oral pathology
Large language models
title Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis
title_full Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis
title_fullStr Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis
title_full_unstemmed Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis
title_short Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis
title_sort artificial intelligence performance in answering multiple choice oral pathology questions a comparative analysis
topic Artificial intelligence
Oral pathology
Large language models
url https://doi.org/10.1186/s12903-025-05926-2
work_keys_str_mv AT birkaneyupyilmaz artificialintelligenceperformanceinansweringmultiplechoiceoralpathologyquestionsacomparativeanalysis
AT busranurgokkurtyilmaz artificialintelligenceperformanceinansweringmultiplechoiceoralpathologyquestionsacomparativeanalysis
AT furkanozbey artificialintelligenceperformanceinansweringmultiplechoiceoralpathologyquestionsacomparativeanalysis