Large language models provide discordant information compared to ophthalmology guidelines

Abstract To evaluate the agreement of LLMs with the Preferred Practice Patterns® (PPP) guidelines developed by the American Academy of Ophthalmology (AAO). Open questions based on the AAO PPP were submitted to five LLMs: GPT-o1 and GPT-4o by OpenAI, Claude 3.5 Sonnet by Anthropic, Gemini 1.5 Pro by...

Full description

Saved in:
Bibliographic Details
Main Authors: Andrea Taloni, Antonia Carmen Sangregorio, Giuseppe Alessio, Maria Angela Romeo, Giulia Coco, Linda Marie Louise Busin, Andrea Sollazzo, Vincenzo Scorcia, Giuseppe Giannaccare
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-06404-z
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849402612714045440
author Andrea Taloni
Antonia Carmen Sangregorio
Giuseppe Alessio
Maria Angela Romeo
Giulia Coco
Linda Marie Louise Busin
Andrea Sollazzo
Vincenzo Scorcia
Giuseppe Giannaccare
author_facet Andrea Taloni
Antonia Carmen Sangregorio
Giuseppe Alessio
Maria Angela Romeo
Giulia Coco
Linda Marie Louise Busin
Andrea Sollazzo
Vincenzo Scorcia
Giuseppe Giannaccare
author_sort Andrea Taloni
collection DOAJ
description Abstract To evaluate the agreement of LLMs with the Preferred Practice Patterns® (PPP) guidelines developed by the American Academy of Ophthalmology (AAO). Open questions based on the AAO PPP were submitted to five LLMs: GPT-o1 and GPT-4o by OpenAI, Claude 3.5 Sonnet by Anthropic, Gemini 1.5 Pro by Google, and DeepSeek-R1-Lite-Preview. Questions were classified as “open” or “confirmatory with positive/negative ground-truth answer”. Three blinded investigators classified responses as “concordant”, “undetermined”, or “discordant” compared to the AAO PPP. Undetermined and discordant answers were analyzed to assess harming potential for patients. Responses referencing peer-reviewed articles were reported. In total, 147 questions were submitted to the LLMs. Concordant answers were 135 (91.8%) for GPT-o1, 133 (90.5%) for GPT-4o, 136 (92.5%) for Claude 3.5 Sonnet, 124 (84.4%) for Gemini 1.5 Pro, and 119 (81.0%) for DeepSeek-R1-Lite-Preview (P = 0.006). The highest number of harmful answers was reported for Gemini 1.5 Pro (n = 6, 4.1%), followed by DeepSeek-R1-Lite-Preview (n = 5, 3.4%). Gemini 1.5 Pro was the most transparent model (86 references, 58.5%). Other LLMs referenced papers in 9.5–15.6% of their responses. LLMs can provide discordant answers compared to ophthalmology guidelines, potentially harming patients by delaying diagnosis or recommending suboptimal treatments.
format Article
id doaj-art-92438cdfee404ada9df8a7b46e03da01
institution Kabale University
issn 2045-2322
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-92438cdfee404ada9df8a7b46e03da012025-08-20T03:37:30ZengNature PortfolioScientific Reports2045-23222025-07-0115111010.1038/s41598-025-06404-zLarge language models provide discordant information compared to ophthalmology guidelinesAndrea Taloni0Antonia Carmen Sangregorio1Giuseppe Alessio2Maria Angela Romeo3Giulia Coco4Linda Marie Louise Busin5Andrea Sollazzo6Vincenzo Scorcia7Giuseppe Giannaccare8Department of Translational Medicine, University of FerraraDepartment of Ophthalmology, University Magna Graecia of CatanzaroDepartment of Ophthalmology, University Magna Graecia of CatanzaroDepartment of Ophthalmology, University Magna Graecia of CatanzaroDepartment of Clinical Sciences and Translational Medicine, University of Rome Tor VergataDepartment of Ophthalmology, Ospedali Privati Forlì “Villa Igea”Department of Translational Medicine, University of FerraraDepartment of Ophthalmology, University Magna Graecia of CatanzaroDepartment of Surgical Sciences, Eye Clinic, University of CagliariAbstract To evaluate the agreement of LLMs with the Preferred Practice Patterns® (PPP) guidelines developed by the American Academy of Ophthalmology (AAO). Open questions based on the AAO PPP were submitted to five LLMs: GPT-o1 and GPT-4o by OpenAI, Claude 3.5 Sonnet by Anthropic, Gemini 1.5 Pro by Google, and DeepSeek-R1-Lite-Preview. Questions were classified as “open” or “confirmatory with positive/negative ground-truth answer”. Three blinded investigators classified responses as “concordant”, “undetermined”, or “discordant” compared to the AAO PPP. Undetermined and discordant answers were analyzed to assess harming potential for patients. Responses referencing peer-reviewed articles were reported. In total, 147 questions were submitted to the LLMs. Concordant answers were 135 (91.8%) for GPT-o1, 133 (90.5%) for GPT-4o, 136 (92.5%) for Claude 3.5 Sonnet, 124 (84.4%) for Gemini 1.5 Pro, and 119 (81.0%) for DeepSeek-R1-Lite-Preview (P = 0.006). The highest number of harmful answers was reported for Gemini 1.5 Pro (n = 6, 4.1%), followed by DeepSeek-R1-Lite-Preview (n = 5, 3.4%). Gemini 1.5 Pro was the most transparent model (86 references, 58.5%). Other LLMs referenced papers in 9.5–15.6% of their responses. LLMs can provide discordant answers compared to ophthalmology guidelines, potentially harming patients by delaying diagnosis or recommending suboptimal treatments.https://doi.org/10.1038/s41598-025-06404-zLarge language modelArtificial intelligenceGuidelinesPreferred practice patternsAmerican Academy of OphthalmologyAAO
spellingShingle Andrea Taloni
Antonia Carmen Sangregorio
Giuseppe Alessio
Maria Angela Romeo
Giulia Coco
Linda Marie Louise Busin
Andrea Sollazzo
Vincenzo Scorcia
Giuseppe Giannaccare
Large language models provide discordant information compared to ophthalmology guidelines
Scientific Reports
Large language model
Artificial intelligence
Guidelines
Preferred practice patterns
American Academy of Ophthalmology
AAO
title Large language models provide discordant information compared to ophthalmology guidelines
title_full Large language models provide discordant information compared to ophthalmology guidelines
title_fullStr Large language models provide discordant information compared to ophthalmology guidelines
title_full_unstemmed Large language models provide discordant information compared to ophthalmology guidelines
title_short Large language models provide discordant information compared to ophthalmology guidelines
title_sort large language models provide discordant information compared to ophthalmology guidelines
topic Large language model
Artificial intelligence
Guidelines
Preferred practice patterns
American Academy of Ophthalmology
AAO
url https://doi.org/10.1038/s41598-025-06404-z
work_keys_str_mv AT andreataloni largelanguagemodelsprovidediscordantinformationcomparedtoophthalmologyguidelines
AT antoniacarmensangregorio largelanguagemodelsprovidediscordantinformationcomparedtoophthalmologyguidelines
AT giuseppealessio largelanguagemodelsprovidediscordantinformationcomparedtoophthalmologyguidelines
AT mariaangelaromeo largelanguagemodelsprovidediscordantinformationcomparedtoophthalmologyguidelines
AT giuliacoco largelanguagemodelsprovidediscordantinformationcomparedtoophthalmologyguidelines
AT lindamarielouisebusin largelanguagemodelsprovidediscordantinformationcomparedtoophthalmologyguidelines
AT andreasollazzo largelanguagemodelsprovidediscordantinformationcomparedtoophthalmologyguidelines
AT vincenzoscorcia largelanguagemodelsprovidediscordantinformationcomparedtoophthalmologyguidelines
AT giuseppegiannaccare largelanguagemodelsprovidediscordantinformationcomparedtoophthalmologyguidelines