Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.

<h4>Objectives</h4>This study aims to evaluate the performance of the latest large language models (LLMs) in answering dental multiple choice questions (MCQs), including both text-based and image-based questions.<h4>Material and methods</h4>A total of 1490 MCQs from two board...

Full description

Saved in:
Bibliographic Details
Main Authors: Huy Cong Nguyen, Hai Phong Dang, Thuy Linh Nguyen, Viet Hoang, Viet Anh Nguyen
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0317423
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1825206832484843520
author Huy Cong Nguyen
Hai Phong Dang
Thuy Linh Nguyen
Viet Hoang
Viet Anh Nguyen
author_facet Huy Cong Nguyen
Hai Phong Dang
Thuy Linh Nguyen
Viet Hoang
Viet Anh Nguyen
author_sort Huy Cong Nguyen
collection DOAJ
description <h4>Objectives</h4>This study aims to evaluate the performance of the latest large language models (LLMs) in answering dental multiple choice questions (MCQs), including both text-based and image-based questions.<h4>Material and methods</h4>A total of 1490 MCQs from two board review books for the United States National Board Dental Examination were selected. This study evaluated six of the latest LLMs as of August 2024, including ChatGPT 4.0 omni (OpenAI), Gemini Advanced 1.5 Pro (Google), Copilot Pro with GPT-4 Turbo (Microsoft), Claude 3.5 Sonnet (Anthropic), Mistral Large 2 (Mistral AI), and Llama 3.1 405b (Meta). χ2 tests were performed to determine whether there were significant differences in the percentages of correct answers among LLMs for both the total sample and each discipline (p < 0.05).<h4>Results</h4>Significant differences were observed in the percentage of accurate answers among the six LLMs across text-based questions, image-based questions, and the total sample (p<0.001). For the total sample, Copilot (85.5%), Claude (84.0%), and ChatGPT (83.8%) demonstrated the highest accuracy, followed by Mistral (78.3%) and Gemini (77.1%), with Llama (72.4%) exhibiting the lowest.<h4>Conclusions</h4>Newer versions of LLMs demonstrate superior performance in answering dental MCQs compared to earlier versions. Copilot, Claude, and ChatGPT achieved high accuracy on text-based questions and low accuracy on image-based questions. LLMs capable of handling image-based questions demonstrated superior performance compared to LLMs limited to text-based questions.<h4>Clinical relevance</h4>Dental clinicians and students should prioritize the most up-to-date LLMs when supporting their learning, clinical practice, and research.
format Article
id doaj-art-e96e9bb5d9a84c72ad35128ce70fbe9c
institution Kabale University
issn 1932-6203
language English
publishDate 2025-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-e96e9bb5d9a84c72ad35128ce70fbe9c2025-02-07T05:30:51ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01201e031742310.1371/journal.pone.0317423Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.Huy Cong NguyenHai Phong DangThuy Linh NguyenViet HoangViet Anh Nguyen<h4>Objectives</h4>This study aims to evaluate the performance of the latest large language models (LLMs) in answering dental multiple choice questions (MCQs), including both text-based and image-based questions.<h4>Material and methods</h4>A total of 1490 MCQs from two board review books for the United States National Board Dental Examination were selected. This study evaluated six of the latest LLMs as of August 2024, including ChatGPT 4.0 omni (OpenAI), Gemini Advanced 1.5 Pro (Google), Copilot Pro with GPT-4 Turbo (Microsoft), Claude 3.5 Sonnet (Anthropic), Mistral Large 2 (Mistral AI), and Llama 3.1 405b (Meta). χ2 tests were performed to determine whether there were significant differences in the percentages of correct answers among LLMs for both the total sample and each discipline (p < 0.05).<h4>Results</h4>Significant differences were observed in the percentage of accurate answers among the six LLMs across text-based questions, image-based questions, and the total sample (p<0.001). For the total sample, Copilot (85.5%), Claude (84.0%), and ChatGPT (83.8%) demonstrated the highest accuracy, followed by Mistral (78.3%) and Gemini (77.1%), with Llama (72.4%) exhibiting the lowest.<h4>Conclusions</h4>Newer versions of LLMs demonstrate superior performance in answering dental MCQs compared to earlier versions. Copilot, Claude, and ChatGPT achieved high accuracy on text-based questions and low accuracy on image-based questions. LLMs capable of handling image-based questions demonstrated superior performance compared to LLMs limited to text-based questions.<h4>Clinical relevance</h4>Dental clinicians and students should prioritize the most up-to-date LLMs when supporting their learning, clinical practice, and research.https://doi.org/10.1371/journal.pone.0317423
spellingShingle Huy Cong Nguyen
Hai Phong Dang
Thuy Linh Nguyen
Viet Hoang
Viet Anh Nguyen
Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.
PLoS ONE
title Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.
title_full Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.
title_fullStr Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.
title_full_unstemmed Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.
title_short Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.
title_sort accuracy of latest large language models in answering multiple choice questions in dentistry a comparative study
url https://doi.org/10.1371/journal.pone.0317423
work_keys_str_mv AT huycongnguyen accuracyoflatestlargelanguagemodelsinansweringmultiplechoicequestionsindentistryacomparativestudy
AT haiphongdang accuracyoflatestlargelanguagemodelsinansweringmultiplechoicequestionsindentistryacomparativestudy
AT thuylinhnguyen accuracyoflatestlargelanguagemodelsinansweringmultiplechoicequestionsindentistryacomparativestudy
AT viethoang accuracyoflatestlargelanguagemodelsinansweringmultiplechoicequestionsindentistryacomparativestudy
AT vietanhnguyen accuracyoflatestlargelanguagemodelsinansweringmultiplechoicequestionsindentistryacomparativestudy