Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures

Purpose: Infective endocarditis (IE) is a serious, life-threatening condition requiring antibiotic prophylaxis for high-risk individuals undergoing invasive dental procedures. As LLMs are rapidly adopted by dental professionals for their efficiency and accessibility, assessing their accuracy in answ...

Full description

Saved in:
Bibliographic Details
Main Authors: Paak Rewthamrongsris, Jirayu Burapacheep, Vorapat Trachoo, Thantrira Porntaveetus
Format: Article
Language:English
Published: Elsevier 2025-02-01
Series:International Dental Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S0020653924015466
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832592805230280704
author Paak Rewthamrongsris
Jirayu Burapacheep
Vorapat Trachoo
Thantrira Porntaveetus
author_facet Paak Rewthamrongsris
Jirayu Burapacheep
Vorapat Trachoo
Thantrira Porntaveetus
author_sort Paak Rewthamrongsris
collection DOAJ
description Purpose: Infective endocarditis (IE) is a serious, life-threatening condition requiring antibiotic prophylaxis for high-risk individuals undergoing invasive dental procedures. As LLMs are rapidly adopted by dental professionals for their efficiency and accessibility, assessing their accuracy in answering critical questions about antibiotic prophylaxis for IE prevention is crucial. Methods: Twenty-eight true/false questions based on the 2021 American Heart Association (AHA) guidelines for IE were posed to 7 popular LLMs. Each model underwent five independent runs per question using two prompt strategies: a pre-prompt as an experienced dentist and without a pre-prompt. Inter-model comparisons utilised the Kruskal–Wallis test, followed by post-hoc pairwise comparisons using Prism 10 software. Results: Significant differences in accuracy were observed among the LLMs. All LLMs had a narrower confidence interval with a pre-prompt, and most, except Claude 3 Opus, showed improved performance. GPT-4o had the highest accuracy (80% with a pre-prompt, 78.57% without), followed by Gemini 1.5 Pro (78.57% and 77.86%) and Claude 3 Opus (75.71% and 77.14%). Gemini 1.5 Flash had the lowest accuracy (68.57% and 63.57%). Without a pre-prompt, Gemini 1.5 Flash's accuracy was significantly lower than Claude 3 Opus, Gemini 1.5 Pro, and GPT-4o. With a pre-prompt, Gemini 1.5 Flash and Claude 3.5 were significantly less accurate than Gemini 1.5 Pro and GPT-4o. None of the LLMs met the commonly used benchmark scores. All models provided both correct and incorrect answers randomly, except Claude 3.5 Sonnet with a pre-prompt, which consistently gave incorrect answers to eight questions across five runs. Conclusion: LLMs like GPT-4o show promise for retrieving AHA-IE guideline information, achieving up to 80% accuracy. However, complex medical questions may still pose a challenge. Pre-prompts offer a potential solution, and domain-specific training is essential for optimizing LLM performance in healthcare, especially with the emergence of models with increased token limits.
format Article
id doaj-art-a0f574a0c1ac41f2960b22b90114b9c4
institution Kabale University
issn 0020-6539
language English
publishDate 2025-02-01
publisher Elsevier
record_format Article
series International Dental Journal
spelling doaj-art-a0f574a0c1ac41f2960b22b90114b9c42025-01-21T04:12:46ZengElsevierInternational Dental Journal0020-65392025-02-01751206212Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental ProceduresPaak Rewthamrongsris0Jirayu Burapacheep1Vorapat Trachoo2Thantrira Porntaveetus3Department of Anatomy, Faculty of Dentistry, Chulalongkorn University, Bangkok, ThailandStanford University, Stanford, California, USADepartment of Oral and Maxillofacial Surgery, Faculty of Dentistry, Chulalongkorn University, Bangkok, ThailandCenter of Excellence in Genomics and Precision Dentistry, Clinical Research Center, Geriatric Dentistry and Special Patients Care International Program, Department of Physiology, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand; Corresponding author. Center of Excellence in Genomics and Precision Dentistry, Faculty of Dentistry, Chulalongkorn University, Bangkok, 10330, Thailand.Purpose: Infective endocarditis (IE) is a serious, life-threatening condition requiring antibiotic prophylaxis for high-risk individuals undergoing invasive dental procedures. As LLMs are rapidly adopted by dental professionals for their efficiency and accessibility, assessing their accuracy in answering critical questions about antibiotic prophylaxis for IE prevention is crucial. Methods: Twenty-eight true/false questions based on the 2021 American Heart Association (AHA) guidelines for IE were posed to 7 popular LLMs. Each model underwent five independent runs per question using two prompt strategies: a pre-prompt as an experienced dentist and without a pre-prompt. Inter-model comparisons utilised the Kruskal–Wallis test, followed by post-hoc pairwise comparisons using Prism 10 software. Results: Significant differences in accuracy were observed among the LLMs. All LLMs had a narrower confidence interval with a pre-prompt, and most, except Claude 3 Opus, showed improved performance. GPT-4o had the highest accuracy (80% with a pre-prompt, 78.57% without), followed by Gemini 1.5 Pro (78.57% and 77.86%) and Claude 3 Opus (75.71% and 77.14%). Gemini 1.5 Flash had the lowest accuracy (68.57% and 63.57%). Without a pre-prompt, Gemini 1.5 Flash's accuracy was significantly lower than Claude 3 Opus, Gemini 1.5 Pro, and GPT-4o. With a pre-prompt, Gemini 1.5 Flash and Claude 3.5 were significantly less accurate than Gemini 1.5 Pro and GPT-4o. None of the LLMs met the commonly used benchmark scores. All models provided both correct and incorrect answers randomly, except Claude 3.5 Sonnet with a pre-prompt, which consistently gave incorrect answers to eight questions across five runs. Conclusion: LLMs like GPT-4o show promise for retrieving AHA-IE guideline information, achieving up to 80% accuracy. However, complex medical questions may still pose a challenge. Pre-prompts offer a potential solution, and domain-specific training is essential for optimizing LLM performance in healthcare, especially with the emergence of models with increased token limits.http://www.sciencedirect.com/science/article/pii/S0020653924015466Artificial intelligenceChatGPTAHA guidelinesGeminiClaude
spellingShingle Paak Rewthamrongsris
Jirayu Burapacheep
Vorapat Trachoo
Thantrira Porntaveetus
Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures
International Dental Journal
Artificial intelligence
ChatGPT
AHA guidelines
Gemini
Claude
title Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures
title_full Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures
title_fullStr Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures
title_full_unstemmed Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures
title_short Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures
title_sort accuracy of large language models for infective endocarditis prophylaxis in dental procedures
topic Artificial intelligence
ChatGPT
AHA guidelines
Gemini
Claude
url http://www.sciencedirect.com/science/article/pii/S0020653924015466
work_keys_str_mv AT paakrewthamrongsris accuracyoflargelanguagemodelsforinfectiveendocarditisprophylaxisindentalprocedures
AT jirayuburapacheep accuracyoflargelanguagemodelsforinfectiveendocarditisprophylaxisindentalprocedures
AT vorapattrachoo accuracyoflargelanguagemodelsforinfectiveendocarditisprophylaxisindentalprocedures
AT thantriraporntaveetus accuracyoflargelanguagemodelsforinfectiveendocarditisprophylaxisindentalprocedures