Evaluating GPT-4o in infectious disease diagnostics and management: A comparative study with residents and specialists on accuracy, completeness, and clinical support potential

Background Artificial intelligence (AI), particularly GPT models like GPT-4o (omnimodal), is increasingly being integrated into healthcare for providing diagnostic and treatment recommendations. However, the accuracy and clinical applicability of such AI systems remain unclear. Objective This study...

Full description

Saved in:
Bibliographic Details
Main Authors: Lili Zhan, Xiumin Dang, Zhenghua Xie, Chaoying Zeng, Weixing Wu, Xiaoyu Zhang, Li Zhang, Xinjian Cai
Format: Article
Language:English
Published: SAGE Publishing 2025-07-01
Series:Digital Health
Online Access:https://doi.org/10.1177/20552076251355797
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Background Artificial intelligence (AI), particularly GPT models like GPT-4o (omnimodal), is increasingly being integrated into healthcare for providing diagnostic and treatment recommendations. However, the accuracy and clinical applicability of such AI systems remain unclear. Objective This study aimed to evaluate the accuracy and completeness of GPT-4o in comparison to resident physicians and senior infectious disease specialists in diagnosing and managing bacterial, fungal, and viral infections. Methods A comparative study was conducted involving GPT-4o, three resident physicians, and three senior infectious disease experts. Participants answered 75 questions—comprising true/false, open-ended, and clinical case-based scenarios—developed according to international guidelines and clinical practice. Accuracy and completeness were assessed via blinded expert review using Likert scales. Statistical analysis included Chi-square, Fisher's exact, and Kruskal–Wallis tests. Results In true/false questions, GPT-4o showed comparable accuracy (87.5%) to specialists (90.3%) and exceeded residents (77.8%). Specialists outperformed GPT-4o in accuracy on open-ended (p = .008) and clinical case-based questions ( P  = .02). However, GPT-4o demonstrated significantly greater completeness than residents on open-ended ( P  < .0001) and clinical case-based questions ( P  = .01), providing more comprehensive explanations. Conclusions GPT-4o shows promise as a tool for providing comprehensive responses in infectious disease management, although specialists still outperform it in accuracy. Continuous human oversight is recommended to mitigate potential inaccuracies in clinical decision-making. These findings suggest that while GPT-4o may be considered a valuable supplementary tool for medical advice, it should not replace expert consultation in complex clinical decision-making.
ISSN:2055-2076