Evaluation of performance of generative large language models for stroke care

Abstract Stroke is a leading cause of global morbidity and mortality, disproportionately impacting lower socioeconomic groups. In this study, we evaluated three generative LLMs—GPT, Claude, and Gemini—across four stages of stroke care: prevention, diagnosis, treatment, and rehabilitation. Using thre...

Full description

Saved in:
Bibliographic Details
Main Authors: John Tayu Lee, Vincent Cheng-Sheng Li, Jia-Jyun Wu, Hsiao-Hui Chen, Sophia Sin-Yu Su, Brian Pin-Hsuan Chang, Richard Lee Lai, Chi-Hung Liu, Chung-Ting Chen, Valis Tanapima, Toby Kai-Bo Shen, Rifat Atun
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:npj Digital Medicine
Online Access:https://doi.org/10.1038/s41746-025-01830-9
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Stroke is a leading cause of global morbidity and mortality, disproportionately impacting lower socioeconomic groups. In this study, we evaluated three generative LLMs—GPT, Claude, and Gemini—across four stages of stroke care: prevention, diagnosis, treatment, and rehabilitation. Using three prompt engineering techniques—Zero-Shot Learning (ZSL), Chain of Thought (COT), and Talking Out Your Thoughts (TOT)—we applied each to realistic stroke scenarios. Clinical experts assessed the outputs across five domains: (1) accuracy; (2) hallucinations; (3) specificity; (4) empathy; and (5) actionability, based on clinical competency benchmarks. Overall, the LLMs demonstrated suboptimal performance with inconsistent scores across domains. Each prompt engineering method showed strengths in specific areas: TOT does well in empathy and actionability, COT was strong in structured reasoning during diagnosis, and ZSL provided concise, accurate responses with fewer hallucinations, especially in the Treatment stage. However, none consistently met high clinical standards across all stroke care stages.
ISSN:2398-6352