Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection

ObjectiveThis study aimed to evaluate the accuracy and reliability of responses provided by three versions of ChatGPT (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4o) to questions related to Helicobacter pylori (Hp) infection, as well as to explore their potential applications within the healthcare domain.M...

Full description

Saved in:
Bibliographic Details
Main Authors: Yi Ye, En-dian Zheng, Qiao-li Lan, Le-can Wu, Hao-yue Sun, Bei-bei Xu, Ying Wang, Miao-miao Teng
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-05-01
Series:Frontiers in Public Health
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fpubh.2025.1566982/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849726758993002496
author Yi Ye
En-dian Zheng
Qiao-li Lan
Le-can Wu
Hao-yue Sun
Bei-bei Xu
Ying Wang
Miao-miao Teng
author_facet Yi Ye
En-dian Zheng
Qiao-li Lan
Le-can Wu
Hao-yue Sun
Bei-bei Xu
Ying Wang
Miao-miao Teng
author_sort Yi Ye
collection DOAJ
description ObjectiveThis study aimed to evaluate the accuracy and reliability of responses provided by three versions of ChatGPT (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4o) to questions related to Helicobacter pylori (Hp) infection, as well as to explore their potential applications within the healthcare domain.MethodsA panel of experts compiled and refined a set of 27 clinical questions related to Hp. These questions were presented to each ChatGPT version, generating three distinct sets of responses. The responses were evaluated and scored by three gastroenterology specialists utilizing a 5-point Likert scale, with an emphasis on accuracy and comprehensiveness. To assess response stability and reliability, each question was submitted three times over three consecutive days.ResultsStatistically significant differences in the Likert scale scores were observed among the three ChatGPT versions (p < 0.0001). ChatGPT-4o demonstrated the best performance, achieving an average score of 4.46 (standard deviation 0.82) points. Despite its high accuracy, ChatGPT-4o exhibited relatively low repeatability. In contrast, ChatGPT-3.5 exhibited the highest stability, although it occasionally provided incorrect answers. In terms of readability, ChatGPT-4 achieved the highest Flesch Reading Ease score of 24.88 (standard deviation 0.44), however, no statistically significant differences in readability were observed among the versions.ConclusionAll three versions of ChatGPT were effective in addressing Hp-related questions, with ChatGPT-4o delivering the most accurate information. These findings suggest that artificial intelligence-driven chat models hold significant potential in healthcare, facilitating improved patient awareness, self-management, and treatment compliance, as well as supporting physicians in making informed medical decisions by providing accurate information and personalized recommendations.
format Article
id doaj-art-4d1eeda723fd46349cdbe96f459f6773
institution DOAJ
issn 2296-2565
language English
publishDate 2025-05-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Public Health
spelling doaj-art-4d1eeda723fd46349cdbe96f459f67732025-08-20T03:10:06ZengFrontiers Media S.A.Frontiers in Public Health2296-25652025-05-011310.3389/fpubh.2025.15669821566982Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infectionYi Ye0En-dian Zheng1Qiao-li Lan2Le-can Wu3Hao-yue Sun4Bei-bei Xu5Ying Wang6Miao-miao Teng7Department of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaPostgraduate Training Base Alliance of Wenzhou Medical University, Wenzhou, ChinaObjectiveThis study aimed to evaluate the accuracy and reliability of responses provided by three versions of ChatGPT (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4o) to questions related to Helicobacter pylori (Hp) infection, as well as to explore their potential applications within the healthcare domain.MethodsA panel of experts compiled and refined a set of 27 clinical questions related to Hp. These questions were presented to each ChatGPT version, generating three distinct sets of responses. The responses were evaluated and scored by three gastroenterology specialists utilizing a 5-point Likert scale, with an emphasis on accuracy and comprehensiveness. To assess response stability and reliability, each question was submitted three times over three consecutive days.ResultsStatistically significant differences in the Likert scale scores were observed among the three ChatGPT versions (p < 0.0001). ChatGPT-4o demonstrated the best performance, achieving an average score of 4.46 (standard deviation 0.82) points. Despite its high accuracy, ChatGPT-4o exhibited relatively low repeatability. In contrast, ChatGPT-3.5 exhibited the highest stability, although it occasionally provided incorrect answers. In terms of readability, ChatGPT-4 achieved the highest Flesch Reading Ease score of 24.88 (standard deviation 0.44), however, no statistically significant differences in readability were observed among the versions.ConclusionAll three versions of ChatGPT were effective in addressing Hp-related questions, with ChatGPT-4o delivering the most accurate information. These findings suggest that artificial intelligence-driven chat models hold significant potential in healthcare, facilitating improved patient awareness, self-management, and treatment compliance, as well as supporting physicians in making informed medical decisions by providing accurate information and personalized recommendations.https://www.frontiersin.org/articles/10.3389/fpubh.2025.1566982/fullartificial intelligenceHelicobacter pylorilarge language modelpatient educationChatGPT
spellingShingle Yi Ye
En-dian Zheng
Qiao-li Lan
Le-can Wu
Hao-yue Sun
Bei-bei Xu
Ying Wang
Miao-miao Teng
Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection
Frontiers in Public Health
artificial intelligence
Helicobacter pylori
large language model
patient education
ChatGPT
title Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection
title_full Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection
title_fullStr Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection
title_full_unstemmed Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection
title_short Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection
title_sort comparative evaluation of the accuracy and reliability of chatgpt versions in providing information on helicobacter pylori infection
topic artificial intelligence
Helicobacter pylori
large language model
patient education
ChatGPT
url https://www.frontiersin.org/articles/10.3389/fpubh.2025.1566982/full
work_keys_str_mv AT yiye comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection
AT endianzheng comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection
AT qiaolilan comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection
AT lecanwu comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection
AT haoyuesun comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection
AT beibeixu comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection
AT yingwang comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection
AT miaomiaoteng comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection