A comparison of the responses between ChatGPT and doctors in the field of cholelithiasis based on clinical practice guidelines: a cross-sectional study

Background With the development of the information age, an increasing number of patients are seeking information about related diseases on the Internet. In the medical field, several studies have confirmed that ChatGPT has great potential for use in medical education, generating imaging reports, and...

Full description

Saved in:
Bibliographic Details
Main Authors: Tianyang Mao, Xin Zhao, Kangyi Jiang, Qingyun Xie, Manyu Yang, Ruoxuan Wang, Fengwei Gao
Format: Article
Language:English
Published: SAGE Publishing 2025-04-01
Series:Digital Health
Online Access:https://doi.org/10.1177/20552076251331804
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Background With the development of the information age, an increasing number of patients are seeking information about related diseases on the Internet. In the medical field, several studies have confirmed that ChatGPT has great potential for use in medical education, generating imaging reports, and even providing clinical diagnosis and treatment decisions, but its ability to answer questions related to gallstones has not yet been reported in the literature. Objective The aim of this study was to evaluate the consistency and accuracy of ChatGPT-generated answers to clinical questions in cholelithiasis, compared to answers provided by clinical expert. Methods This study designs an answer task based on the clinical practice guidelines for cholelithiasis. The answers are presented in the form of keywords. The questions are categorized into general questions and professional questions. To evaluate the performance of ChatGPT and clinical expert answers, the study employs a modified matching scoring system, a keyword proportion evaluation system, and the DISCERN tool. Results ChatGPT often provides more keywords in its responses, but its accuracy is significantly lower than that of doctors ( P  < .001). In the evaluation of 33 general questions, ChatGPT and doctors demonstrated similar performance in both the modified matching score system and keyword proportion evaluation ( P  = .856 and P  = .829, respectively). However, in the evaluation of 32 professional questions, doctors consistently outperformed ChatGPT ( P  = .004 and P  = .016). Additionally, while the DISCERN tool showed differences between general and professional questions ( P  = .001), both types of questions were evaluated at a high level overall. Conclusions Currently, ChatGPT performs similarly to clinical experts in answering general questions related to cholelithiasis. However, it cannot replace clinical experts in professional clinical decision-making. As ChatGPT's performance improves through deep learning, it is expected to become more useful and effective in the field of cholelithiasis. Nevertheless, in more specialized areas, careful attention and continuous evaluation will be necessary to ensure its accuracy, reliability, and safety in the medical field.
ISSN:2055-2076