Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection

ObjectiveThis study aimed to evaluate the accuracy and reliability of responses provided by three versions of ChatGPT (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4o) to questions related to Helicobacter pylori (Hp) infection, as well as to explore their potential applications within the healthcare domain.M...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yi Ye, En-dian Zheng, Qiao-li Lan, Le-can Wu, Hao-yue Sun, Bei-bei Xu, Ying Wang, Miao-miao Teng
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2025-05-01
Series:	Frontiers in Public Health
Subjects:	artificial intelligence Helicobacter pylori large language model patient education ChatGPT
Online Access:	https://www.frontiersin.org/articles/10.3389/fpubh.2025.1566982/full
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849726758993002496
author	Yi Ye En-dian Zheng Qiao-li Lan Le-can Wu Hao-yue Sun Bei-bei Xu Ying Wang Miao-miao Teng
author_facet	Yi Ye En-dian Zheng Qiao-li Lan Le-can Wu Hao-yue Sun Bei-bei Xu Ying Wang Miao-miao Teng
author_sort	Yi Ye
collection	DOAJ
description	ObjectiveThis study aimed to evaluate the accuracy and reliability of responses provided by three versions of ChatGPT (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4o) to questions related to Helicobacter pylori (Hp) infection, as well as to explore their potential applications within the healthcare domain.MethodsA panel of experts compiled and refined a set of 27 clinical questions related to Hp. These questions were presented to each ChatGPT version, generating three distinct sets of responses. The responses were evaluated and scored by three gastroenterology specialists utilizing a 5-point Likert scale, with an emphasis on accuracy and comprehensiveness. To assess response stability and reliability, each question was submitted three times over three consecutive days.ResultsStatistically significant differences in the Likert scale scores were observed among the three ChatGPT versions (p < 0.0001). ChatGPT-4o demonstrated the best performance, achieving an average score of 4.46 (standard deviation 0.82) points. Despite its high accuracy, ChatGPT-4o exhibited relatively low repeatability. In contrast, ChatGPT-3.5 exhibited the highest stability, although it occasionally provided incorrect answers. In terms of readability, ChatGPT-4 achieved the highest Flesch Reading Ease score of 24.88 (standard deviation 0.44), however, no statistically significant differences in readability were observed among the versions.ConclusionAll three versions of ChatGPT were effective in addressing Hp-related questions, with ChatGPT-4o delivering the most accurate information. These findings suggest that artificial intelligence-driven chat models hold significant potential in healthcare, facilitating improved patient awareness, self-management, and treatment compliance, as well as supporting physicians in making informed medical decisions by providing accurate information and personalized recommendations.
format	Article
id	doaj-art-4d1eeda723fd46349cdbe96f459f6773
institution	DOAJ
issn	2296-2565
language	English
publishDate	2025-05-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Public Health
spelling	doaj-art-4d1eeda723fd46349cdbe96f459f67732025-08-20T03:10:06ZengFrontiers Media S.A.Frontiers in Public Health2296-25652025-05-011310.3389/fpubh.2025.15669821566982Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infectionYi Ye0En-dian Zheng1Qiao-li Lan2Le-can Wu3Hao-yue Sun4Bei-bei Xu5Ying Wang6Miao-miao Teng7Department of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaPostgraduate Training Base Alliance of Wenzhou Medical University, Wenzhou, ChinaObjectiveThis study aimed to evaluate the accuracy and reliability of responses provided by three versions of ChatGPT (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4o) to questions related to Helicobacter pylori (Hp) infection, as well as to explore their potential applications within the healthcare domain.MethodsA panel of experts compiled and refined a set of 27 clinical questions related to Hp. These questions were presented to each ChatGPT version, generating three distinct sets of responses. The responses were evaluated and scored by three gastroenterology specialists utilizing a 5-point Likert scale, with an emphasis on accuracy and comprehensiveness. To assess response stability and reliability, each question was submitted three times over three consecutive days.ResultsStatistically significant differences in the Likert scale scores were observed among the three ChatGPT versions (p < 0.0001). ChatGPT-4o demonstrated the best performance, achieving an average score of 4.46 (standard deviation 0.82) points. Despite its high accuracy, ChatGPT-4o exhibited relatively low repeatability. In contrast, ChatGPT-3.5 exhibited the highest stability, although it occasionally provided incorrect answers. In terms of readability, ChatGPT-4 achieved the highest Flesch Reading Ease score of 24.88 (standard deviation 0.44), however, no statistically significant differences in readability were observed among the versions.ConclusionAll three versions of ChatGPT were effective in addressing Hp-related questions, with ChatGPT-4o delivering the most accurate information. These findings suggest that artificial intelligence-driven chat models hold significant potential in healthcare, facilitating improved patient awareness, self-management, and treatment compliance, as well as supporting physicians in making informed medical decisions by providing accurate information and personalized recommendations.https://www.frontiersin.org/articles/10.3389/fpubh.2025.1566982/fullartificial intelligenceHelicobacter pylorilarge language modelpatient educationChatGPT
spellingShingle	Yi Ye En-dian Zheng Qiao-li Lan Le-can Wu Hao-yue Sun Bei-bei Xu Ying Wang Miao-miao Teng Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection Frontiers in Public Health artificial intelligence Helicobacter pylori large language model patient education ChatGPT
title	Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection
title_full	Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection
title_fullStr	Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection
title_full_unstemmed	Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection
title_short	Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection
title_sort	comparative evaluation of the accuracy and reliability of chatgpt versions in providing information on helicobacter pylori infection
topic	artificial intelligence Helicobacter pylori large language model patient education ChatGPT
url	https://www.frontiersin.org/articles/10.3389/fpubh.2025.1566982/full
work_keys_str_mv	AT yiye comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection AT endianzheng comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection AT qiaolilan comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection AT lecanwu comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection AT haoyuesun comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection AT beibeixu comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection AT yingwang comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection AT miaomiaoteng comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection

Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection

Similar Items