Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures
Purpose: Infective endocarditis (IE) is a serious, life-threatening condition requiring antibiotic prophylaxis for high-risk individuals undergoing invasive dental procedures. As LLMs are rapidly adopted by dental professionals for their efficiency and accessibility, assessing their accuracy in answ...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-02-01
|
Series: | International Dental Journal |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S0020653924015466 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832592805230280704 |
---|---|
author | Paak Rewthamrongsris Jirayu Burapacheep Vorapat Trachoo Thantrira Porntaveetus |
author_facet | Paak Rewthamrongsris Jirayu Burapacheep Vorapat Trachoo Thantrira Porntaveetus |
author_sort | Paak Rewthamrongsris |
collection | DOAJ |
description | Purpose: Infective endocarditis (IE) is a serious, life-threatening condition requiring antibiotic prophylaxis for high-risk individuals undergoing invasive dental procedures. As LLMs are rapidly adopted by dental professionals for their efficiency and accessibility, assessing their accuracy in answering critical questions about antibiotic prophylaxis for IE prevention is crucial. Methods: Twenty-eight true/false questions based on the 2021 American Heart Association (AHA) guidelines for IE were posed to 7 popular LLMs. Each model underwent five independent runs per question using two prompt strategies: a pre-prompt as an experienced dentist and without a pre-prompt. Inter-model comparisons utilised the Kruskal–Wallis test, followed by post-hoc pairwise comparisons using Prism 10 software. Results: Significant differences in accuracy were observed among the LLMs. All LLMs had a narrower confidence interval with a pre-prompt, and most, except Claude 3 Opus, showed improved performance. GPT-4o had the highest accuracy (80% with a pre-prompt, 78.57% without), followed by Gemini 1.5 Pro (78.57% and 77.86%) and Claude 3 Opus (75.71% and 77.14%). Gemini 1.5 Flash had the lowest accuracy (68.57% and 63.57%). Without a pre-prompt, Gemini 1.5 Flash's accuracy was significantly lower than Claude 3 Opus, Gemini 1.5 Pro, and GPT-4o. With a pre-prompt, Gemini 1.5 Flash and Claude 3.5 were significantly less accurate than Gemini 1.5 Pro and GPT-4o. None of the LLMs met the commonly used benchmark scores. All models provided both correct and incorrect answers randomly, except Claude 3.5 Sonnet with a pre-prompt, which consistently gave incorrect answers to eight questions across five runs. Conclusion: LLMs like GPT-4o show promise for retrieving AHA-IE guideline information, achieving up to 80% accuracy. However, complex medical questions may still pose a challenge. Pre-prompts offer a potential solution, and domain-specific training is essential for optimizing LLM performance in healthcare, especially with the emergence of models with increased token limits. |
format | Article |
id | doaj-art-a0f574a0c1ac41f2960b22b90114b9c4 |
institution | Kabale University |
issn | 0020-6539 |
language | English |
publishDate | 2025-02-01 |
publisher | Elsevier |
record_format | Article |
series | International Dental Journal |
spelling | doaj-art-a0f574a0c1ac41f2960b22b90114b9c42025-01-21T04:12:46ZengElsevierInternational Dental Journal0020-65392025-02-01751206212Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental ProceduresPaak Rewthamrongsris0Jirayu Burapacheep1Vorapat Trachoo2Thantrira Porntaveetus3Department of Anatomy, Faculty of Dentistry, Chulalongkorn University, Bangkok, ThailandStanford University, Stanford, California, USADepartment of Oral and Maxillofacial Surgery, Faculty of Dentistry, Chulalongkorn University, Bangkok, ThailandCenter of Excellence in Genomics and Precision Dentistry, Clinical Research Center, Geriatric Dentistry and Special Patients Care International Program, Department of Physiology, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand; Corresponding author. Center of Excellence in Genomics and Precision Dentistry, Faculty of Dentistry, Chulalongkorn University, Bangkok, 10330, Thailand.Purpose: Infective endocarditis (IE) is a serious, life-threatening condition requiring antibiotic prophylaxis for high-risk individuals undergoing invasive dental procedures. As LLMs are rapidly adopted by dental professionals for their efficiency and accessibility, assessing their accuracy in answering critical questions about antibiotic prophylaxis for IE prevention is crucial. Methods: Twenty-eight true/false questions based on the 2021 American Heart Association (AHA) guidelines for IE were posed to 7 popular LLMs. Each model underwent five independent runs per question using two prompt strategies: a pre-prompt as an experienced dentist and without a pre-prompt. Inter-model comparisons utilised the Kruskal–Wallis test, followed by post-hoc pairwise comparisons using Prism 10 software. Results: Significant differences in accuracy were observed among the LLMs. All LLMs had a narrower confidence interval with a pre-prompt, and most, except Claude 3 Opus, showed improved performance. GPT-4o had the highest accuracy (80% with a pre-prompt, 78.57% without), followed by Gemini 1.5 Pro (78.57% and 77.86%) and Claude 3 Opus (75.71% and 77.14%). Gemini 1.5 Flash had the lowest accuracy (68.57% and 63.57%). Without a pre-prompt, Gemini 1.5 Flash's accuracy was significantly lower than Claude 3 Opus, Gemini 1.5 Pro, and GPT-4o. With a pre-prompt, Gemini 1.5 Flash and Claude 3.5 were significantly less accurate than Gemini 1.5 Pro and GPT-4o. None of the LLMs met the commonly used benchmark scores. All models provided both correct and incorrect answers randomly, except Claude 3.5 Sonnet with a pre-prompt, which consistently gave incorrect answers to eight questions across five runs. Conclusion: LLMs like GPT-4o show promise for retrieving AHA-IE guideline information, achieving up to 80% accuracy. However, complex medical questions may still pose a challenge. Pre-prompts offer a potential solution, and domain-specific training is essential for optimizing LLM performance in healthcare, especially with the emergence of models with increased token limits.http://www.sciencedirect.com/science/article/pii/S0020653924015466Artificial intelligenceChatGPTAHA guidelinesGeminiClaude |
spellingShingle | Paak Rewthamrongsris Jirayu Burapacheep Vorapat Trachoo Thantrira Porntaveetus Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures International Dental Journal Artificial intelligence ChatGPT AHA guidelines Gemini Claude |
title | Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures |
title_full | Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures |
title_fullStr | Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures |
title_full_unstemmed | Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures |
title_short | Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures |
title_sort | accuracy of large language models for infective endocarditis prophylaxis in dental procedures |
topic | Artificial intelligence ChatGPT AHA guidelines Gemini Claude |
url | http://www.sciencedirect.com/science/article/pii/S0020653924015466 |
work_keys_str_mv | AT paakrewthamrongsris accuracyoflargelanguagemodelsforinfectiveendocarditisprophylaxisindentalprocedures AT jirayuburapacheep accuracyoflargelanguagemodelsforinfectiveendocarditisprophylaxisindentalprocedures AT vorapattrachoo accuracyoflargelanguagemodelsforinfectiveendocarditisprophylaxisindentalprocedures AT thantriraporntaveetus accuracyoflargelanguagemodelsforinfectiveendocarditisprophylaxisindentalprocedures |