Ability of ChatGPT to Replace Doctors in Patient Education: Cross-Sectional Comparative Analysis of Inflammatory Bowel Disease

BackgroundAlthough large language models (LLMs) such as ChatGPT show promise for providing specialized information, their quality requires further evaluation. This is especially true considering that these models are trained on internet text and the quality of health-related...

Full description

Saved in:
Bibliographic Details
Main Authors: Zelin Yan, Jingwen Liu, Yihong Fan, Shiyuan Lu, Dingting Xu, Yun Yang, Honggang Wang, Jie Mao, Hou-Chiang Tseng, Tao-Hsing Chang, Yan Chen
Format: Article
Language:English
Published: JMIR Publications 2025-03-01
Series:Journal of Medical Internet Research
Online Access:https://www.jmir.org/2025/1/e62857
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850275276051709952
author Zelin Yan
Jingwen Liu
Yihong Fan
Shiyuan Lu
Dingting Xu
Yun Yang
Honggang Wang
Jie Mao
Hou-Chiang Tseng
Tao-Hsing Chang
Yan Chen
author_facet Zelin Yan
Jingwen Liu
Yihong Fan
Shiyuan Lu
Dingting Xu
Yun Yang
Honggang Wang
Jie Mao
Hou-Chiang Tseng
Tao-Hsing Chang
Yan Chen
author_sort Zelin Yan
collection DOAJ
description BackgroundAlthough large language models (LLMs) such as ChatGPT show promise for providing specialized information, their quality requires further evaluation. This is especially true considering that these models are trained on internet text and the quality of health-related information available online varies widely. ObjectiveThe aim of this study was to evaluate the performance of ChatGPT in the context of patient education for individuals with chronic diseases, comparing it with that of industry experts to elucidate its strengths and limitations. MethodsThis evaluation was conducted in September 2023 by analyzing the responses of ChatGPT and specialist doctors to questions posed by patients with inflammatory bowel disease (IBD). We compared their performance in terms of subjective accuracy, empathy, completeness, and overall quality, as well as readability to support objective analysis. ResultsIn a series of 1578 binary choice assessments, ChatGPT was preferred in 48.4% (95% CI 45.9%-50.9%) of instances. There were 12 instances where ChatGPT’s responses were unanimously preferred by all evaluators, compared with 17 instances for specialist doctors. In terms of overall quality, there was no significant difference between the responses of ChatGPT (3.98, 95% CI 3.93-4.02) and those of specialist doctors (3.95, 95% CI 3.90-4.00; t524=0.95, P=.34), both being considered “good.” Although differences in accuracy (t521=0.48, P=.63) and empathy (t511=2.19, P=.03) lacked statistical significance, the completeness of textual output (t509=9.27, P<.001) was a distinct advantage of the LLM (ChatGPT). In the sections of the questionnaire where patients and doctors responded together (Q223-Q242), ChatGPT demonstrated inferior performance (t36=2.91, P=.006). Regarding readability, no statistical difference was found between the responses of specialist doctors (median: 7th grade; Q1: 4th grade; Q3: 8th grade) and those of ChatGPT (median: 7th grade; Q1: 7th grade; Q3: 8th grade) according to the Mann-Whitney U test (P=.09). The overall quality of ChatGPT’s output exhibited strong correlations with other subdimensions (with empathy: r=0.842; with accuracy: r=0.839; with completeness: r=0.795), and there was also a high correlation between the subdimensions of accuracy and completeness (r=0.762). ConclusionsChatGPT demonstrated more stable performance across various dimensions. Its output of health information content is more structurally sound, addressing the issue of variability in the information from individual specialist doctors. ChatGPT’s performance highlights its potential as an auxiliary tool for health information, despite limitations such as artificial intelligence hallucinations. It is recommended that patients be involved in the creation and evaluation of health information to enhance the quality and relevance of the information.
format Article
id doaj-art-9aa6ca6564cd4705ae5f060e6c1516ed
institution OA Journals
issn 1438-8871
language English
publishDate 2025-03-01
publisher JMIR Publications
record_format Article
series Journal of Medical Internet Research
spelling doaj-art-9aa6ca6564cd4705ae5f060e6c1516ed2025-08-20T01:50:49ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-03-0127e6285710.2196/62857Ability of ChatGPT to Replace Doctors in Patient Education: Cross-Sectional Comparative Analysis of Inflammatory Bowel DiseaseZelin Yanhttps://orcid.org/0009-0005-3123-4875Jingwen Liuhttps://orcid.org/0000-0001-5179-1592Yihong Fanhttps://orcid.org/0000-0001-9554-0541Shiyuan Luhttps://orcid.org/0000-0002-2825-8244Dingting Xuhttps://orcid.org/0000-0001-8728-2378Yun Yanghttps://orcid.org/0009-0009-3323-7569Honggang Wanghttps://orcid.org/0000-0003-4761-0407Jie Maohttps://orcid.org/0009-0006-9597-1249Hou-Chiang Tsenghttps://orcid.org/0000-0001-5249-7280Tao-Hsing Changhttps://orcid.org/0000-0003-4390-5021Yan Chenhttps://orcid.org/0000-0003-3457-2560 BackgroundAlthough large language models (LLMs) such as ChatGPT show promise for providing specialized information, their quality requires further evaluation. This is especially true considering that these models are trained on internet text and the quality of health-related information available online varies widely. ObjectiveThe aim of this study was to evaluate the performance of ChatGPT in the context of patient education for individuals with chronic diseases, comparing it with that of industry experts to elucidate its strengths and limitations. MethodsThis evaluation was conducted in September 2023 by analyzing the responses of ChatGPT and specialist doctors to questions posed by patients with inflammatory bowel disease (IBD). We compared their performance in terms of subjective accuracy, empathy, completeness, and overall quality, as well as readability to support objective analysis. ResultsIn a series of 1578 binary choice assessments, ChatGPT was preferred in 48.4% (95% CI 45.9%-50.9%) of instances. There were 12 instances where ChatGPT’s responses were unanimously preferred by all evaluators, compared with 17 instances for specialist doctors. In terms of overall quality, there was no significant difference between the responses of ChatGPT (3.98, 95% CI 3.93-4.02) and those of specialist doctors (3.95, 95% CI 3.90-4.00; t524=0.95, P=.34), both being considered “good.” Although differences in accuracy (t521=0.48, P=.63) and empathy (t511=2.19, P=.03) lacked statistical significance, the completeness of textual output (t509=9.27, P<.001) was a distinct advantage of the LLM (ChatGPT). In the sections of the questionnaire where patients and doctors responded together (Q223-Q242), ChatGPT demonstrated inferior performance (t36=2.91, P=.006). Regarding readability, no statistical difference was found between the responses of specialist doctors (median: 7th grade; Q1: 4th grade; Q3: 8th grade) and those of ChatGPT (median: 7th grade; Q1: 7th grade; Q3: 8th grade) according to the Mann-Whitney U test (P=.09). The overall quality of ChatGPT’s output exhibited strong correlations with other subdimensions (with empathy: r=0.842; with accuracy: r=0.839; with completeness: r=0.795), and there was also a high correlation between the subdimensions of accuracy and completeness (r=0.762). ConclusionsChatGPT demonstrated more stable performance across various dimensions. Its output of health information content is more structurally sound, addressing the issue of variability in the information from individual specialist doctors. ChatGPT’s performance highlights its potential as an auxiliary tool for health information, despite limitations such as artificial intelligence hallucinations. It is recommended that patients be involved in the creation and evaluation of health information to enhance the quality and relevance of the information.https://www.jmir.org/2025/1/e62857
spellingShingle Zelin Yan
Jingwen Liu
Yihong Fan
Shiyuan Lu
Dingting Xu
Yun Yang
Honggang Wang
Jie Mao
Hou-Chiang Tseng
Tao-Hsing Chang
Yan Chen
Ability of ChatGPT to Replace Doctors in Patient Education: Cross-Sectional Comparative Analysis of Inflammatory Bowel Disease
Journal of Medical Internet Research
title Ability of ChatGPT to Replace Doctors in Patient Education: Cross-Sectional Comparative Analysis of Inflammatory Bowel Disease
title_full Ability of ChatGPT to Replace Doctors in Patient Education: Cross-Sectional Comparative Analysis of Inflammatory Bowel Disease
title_fullStr Ability of ChatGPT to Replace Doctors in Patient Education: Cross-Sectional Comparative Analysis of Inflammatory Bowel Disease
title_full_unstemmed Ability of ChatGPT to Replace Doctors in Patient Education: Cross-Sectional Comparative Analysis of Inflammatory Bowel Disease
title_short Ability of ChatGPT to Replace Doctors in Patient Education: Cross-Sectional Comparative Analysis of Inflammatory Bowel Disease
title_sort ability of chatgpt to replace doctors in patient education cross sectional comparative analysis of inflammatory bowel disease
url https://www.jmir.org/2025/1/e62857
work_keys_str_mv AT zelinyan abilityofchatgpttoreplacedoctorsinpatienteducationcrosssectionalcomparativeanalysisofinflammatoryboweldisease
AT jingwenliu abilityofchatgpttoreplacedoctorsinpatienteducationcrosssectionalcomparativeanalysisofinflammatoryboweldisease
AT yihongfan abilityofchatgpttoreplacedoctorsinpatienteducationcrosssectionalcomparativeanalysisofinflammatoryboweldisease
AT shiyuanlu abilityofchatgpttoreplacedoctorsinpatienteducationcrosssectionalcomparativeanalysisofinflammatoryboweldisease
AT dingtingxu abilityofchatgpttoreplacedoctorsinpatienteducationcrosssectionalcomparativeanalysisofinflammatoryboweldisease
AT yunyang abilityofchatgpttoreplacedoctorsinpatienteducationcrosssectionalcomparativeanalysisofinflammatoryboweldisease
AT honggangwang abilityofchatgpttoreplacedoctorsinpatienteducationcrosssectionalcomparativeanalysisofinflammatoryboweldisease
AT jiemao abilityofchatgpttoreplacedoctorsinpatienteducationcrosssectionalcomparativeanalysisofinflammatoryboweldisease
AT houchiangtseng abilityofchatgpttoreplacedoctorsinpatienteducationcrosssectionalcomparativeanalysisofinflammatoryboweldisease
AT taohsingchang abilityofchatgpttoreplacedoctorsinpatienteducationcrosssectionalcomparativeanalysisofinflammatoryboweldisease
AT yanchen abilityofchatgpttoreplacedoctorsinpatienteducationcrosssectionalcomparativeanalysisofinflammatoryboweldisease