Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.
<h4>Objectives</h4>This study aims to evaluate the performance of the latest large language models (LLMs) in answering dental multiple choice questions (MCQs), including both text-based and image-based questions.<h4>Material and methods</h4>A total of 1490 MCQs from two board...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2025-01-01
|
Series: | PLoS ONE |
Online Access: | https://doi.org/10.1371/journal.pone.0317423 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1825206832484843520 |
---|---|
author | Huy Cong Nguyen Hai Phong Dang Thuy Linh Nguyen Viet Hoang Viet Anh Nguyen |
author_facet | Huy Cong Nguyen Hai Phong Dang Thuy Linh Nguyen Viet Hoang Viet Anh Nguyen |
author_sort | Huy Cong Nguyen |
collection | DOAJ |
description | <h4>Objectives</h4>This study aims to evaluate the performance of the latest large language models (LLMs) in answering dental multiple choice questions (MCQs), including both text-based and image-based questions.<h4>Material and methods</h4>A total of 1490 MCQs from two board review books for the United States National Board Dental Examination were selected. This study evaluated six of the latest LLMs as of August 2024, including ChatGPT 4.0 omni (OpenAI), Gemini Advanced 1.5 Pro (Google), Copilot Pro with GPT-4 Turbo (Microsoft), Claude 3.5 Sonnet (Anthropic), Mistral Large 2 (Mistral AI), and Llama 3.1 405b (Meta). χ2 tests were performed to determine whether there were significant differences in the percentages of correct answers among LLMs for both the total sample and each discipline (p < 0.05).<h4>Results</h4>Significant differences were observed in the percentage of accurate answers among the six LLMs across text-based questions, image-based questions, and the total sample (p<0.001). For the total sample, Copilot (85.5%), Claude (84.0%), and ChatGPT (83.8%) demonstrated the highest accuracy, followed by Mistral (78.3%) and Gemini (77.1%), with Llama (72.4%) exhibiting the lowest.<h4>Conclusions</h4>Newer versions of LLMs demonstrate superior performance in answering dental MCQs compared to earlier versions. Copilot, Claude, and ChatGPT achieved high accuracy on text-based questions and low accuracy on image-based questions. LLMs capable of handling image-based questions demonstrated superior performance compared to LLMs limited to text-based questions.<h4>Clinical relevance</h4>Dental clinicians and students should prioritize the most up-to-date LLMs when supporting their learning, clinical practice, and research. |
format | Article |
id | doaj-art-e96e9bb5d9a84c72ad35128ce70fbe9c |
institution | Kabale University |
issn | 1932-6203 |
language | English |
publishDate | 2025-01-01 |
publisher | Public Library of Science (PLoS) |
record_format | Article |
series | PLoS ONE |
spelling | doaj-art-e96e9bb5d9a84c72ad35128ce70fbe9c2025-02-07T05:30:51ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01201e031742310.1371/journal.pone.0317423Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.Huy Cong NguyenHai Phong DangThuy Linh NguyenViet HoangViet Anh Nguyen<h4>Objectives</h4>This study aims to evaluate the performance of the latest large language models (LLMs) in answering dental multiple choice questions (MCQs), including both text-based and image-based questions.<h4>Material and methods</h4>A total of 1490 MCQs from two board review books for the United States National Board Dental Examination were selected. This study evaluated six of the latest LLMs as of August 2024, including ChatGPT 4.0 omni (OpenAI), Gemini Advanced 1.5 Pro (Google), Copilot Pro with GPT-4 Turbo (Microsoft), Claude 3.5 Sonnet (Anthropic), Mistral Large 2 (Mistral AI), and Llama 3.1 405b (Meta). χ2 tests were performed to determine whether there were significant differences in the percentages of correct answers among LLMs for both the total sample and each discipline (p < 0.05).<h4>Results</h4>Significant differences were observed in the percentage of accurate answers among the six LLMs across text-based questions, image-based questions, and the total sample (p<0.001). For the total sample, Copilot (85.5%), Claude (84.0%), and ChatGPT (83.8%) demonstrated the highest accuracy, followed by Mistral (78.3%) and Gemini (77.1%), with Llama (72.4%) exhibiting the lowest.<h4>Conclusions</h4>Newer versions of LLMs demonstrate superior performance in answering dental MCQs compared to earlier versions. Copilot, Claude, and ChatGPT achieved high accuracy on text-based questions and low accuracy on image-based questions. LLMs capable of handling image-based questions demonstrated superior performance compared to LLMs limited to text-based questions.<h4>Clinical relevance</h4>Dental clinicians and students should prioritize the most up-to-date LLMs when supporting their learning, clinical practice, and research.https://doi.org/10.1371/journal.pone.0317423 |
spellingShingle | Huy Cong Nguyen Hai Phong Dang Thuy Linh Nguyen Viet Hoang Viet Anh Nguyen Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study. PLoS ONE |
title | Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study. |
title_full | Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study. |
title_fullStr | Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study. |
title_full_unstemmed | Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study. |
title_short | Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study. |
title_sort | accuracy of latest large language models in answering multiple choice questions in dentistry a comparative study |
url | https://doi.org/10.1371/journal.pone.0317423 |
work_keys_str_mv | AT huycongnguyen accuracyoflatestlargelanguagemodelsinansweringmultiplechoicequestionsindentistryacomparativestudy AT haiphongdang accuracyoflatestlargelanguagemodelsinansweringmultiplechoicequestionsindentistryacomparativestudy AT thuylinhnguyen accuracyoflatestlargelanguagemodelsinansweringmultiplechoicequestionsindentistryacomparativestudy AT viethoang accuracyoflatestlargelanguagemodelsinansweringmultiplechoicequestionsindentistryacomparativestudy AT vietanhnguyen accuracyoflatestlargelanguagemodelsinansweringmultiplechoicequestionsindentistryacomparativestudy |