Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection
ObjectiveThis study aimed to evaluate the accuracy and reliability of responses provided by three versions of ChatGPT (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4o) to questions related to Helicobacter pylori (Hp) infection, as well as to explore their potential applications within the healthcare domain.M...
Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Frontiers Media S.A.
2025-05-01
|
| Series: | Frontiers in Public Health |
| Subjects: | |
| Online Access: | https://www.frontiersin.org/articles/10.3389/fpubh.2025.1566982/full |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849726758993002496 |
|---|---|
| author | Yi Ye En-dian Zheng Qiao-li Lan Le-can Wu Hao-yue Sun Bei-bei Xu Ying Wang Miao-miao Teng |
| author_facet | Yi Ye En-dian Zheng Qiao-li Lan Le-can Wu Hao-yue Sun Bei-bei Xu Ying Wang Miao-miao Teng |
| author_sort | Yi Ye |
| collection | DOAJ |
| description | ObjectiveThis study aimed to evaluate the accuracy and reliability of responses provided by three versions of ChatGPT (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4o) to questions related to Helicobacter pylori (Hp) infection, as well as to explore their potential applications within the healthcare domain.MethodsA panel of experts compiled and refined a set of 27 clinical questions related to Hp. These questions were presented to each ChatGPT version, generating three distinct sets of responses. The responses were evaluated and scored by three gastroenterology specialists utilizing a 5-point Likert scale, with an emphasis on accuracy and comprehensiveness. To assess response stability and reliability, each question was submitted three times over three consecutive days.ResultsStatistically significant differences in the Likert scale scores were observed among the three ChatGPT versions (p < 0.0001). ChatGPT-4o demonstrated the best performance, achieving an average score of 4.46 (standard deviation 0.82) points. Despite its high accuracy, ChatGPT-4o exhibited relatively low repeatability. In contrast, ChatGPT-3.5 exhibited the highest stability, although it occasionally provided incorrect answers. In terms of readability, ChatGPT-4 achieved the highest Flesch Reading Ease score of 24.88 (standard deviation 0.44), however, no statistically significant differences in readability were observed among the versions.ConclusionAll three versions of ChatGPT were effective in addressing Hp-related questions, with ChatGPT-4o delivering the most accurate information. These findings suggest that artificial intelligence-driven chat models hold significant potential in healthcare, facilitating improved patient awareness, self-management, and treatment compliance, as well as supporting physicians in making informed medical decisions by providing accurate information and personalized recommendations. |
| format | Article |
| id | doaj-art-4d1eeda723fd46349cdbe96f459f6773 |
| institution | DOAJ |
| issn | 2296-2565 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | Frontiers Media S.A. |
| record_format | Article |
| series | Frontiers in Public Health |
| spelling | doaj-art-4d1eeda723fd46349cdbe96f459f67732025-08-20T03:10:06ZengFrontiers Media S.A.Frontiers in Public Health2296-25652025-05-011310.3389/fpubh.2025.15669821566982Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infectionYi Ye0En-dian Zheng1Qiao-li Lan2Le-can Wu3Hao-yue Sun4Bei-bei Xu5Ying Wang6Miao-miao Teng7Department of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaDepartment of Gastroenterology, Wenzhou People's Hospital, The Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, ChinaPostgraduate Training Base Alliance of Wenzhou Medical University, Wenzhou, ChinaObjectiveThis study aimed to evaluate the accuracy and reliability of responses provided by three versions of ChatGPT (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4o) to questions related to Helicobacter pylori (Hp) infection, as well as to explore their potential applications within the healthcare domain.MethodsA panel of experts compiled and refined a set of 27 clinical questions related to Hp. These questions were presented to each ChatGPT version, generating three distinct sets of responses. The responses were evaluated and scored by three gastroenterology specialists utilizing a 5-point Likert scale, with an emphasis on accuracy and comprehensiveness. To assess response stability and reliability, each question was submitted three times over three consecutive days.ResultsStatistically significant differences in the Likert scale scores were observed among the three ChatGPT versions (p < 0.0001). ChatGPT-4o demonstrated the best performance, achieving an average score of 4.46 (standard deviation 0.82) points. Despite its high accuracy, ChatGPT-4o exhibited relatively low repeatability. In contrast, ChatGPT-3.5 exhibited the highest stability, although it occasionally provided incorrect answers. In terms of readability, ChatGPT-4 achieved the highest Flesch Reading Ease score of 24.88 (standard deviation 0.44), however, no statistically significant differences in readability were observed among the versions.ConclusionAll three versions of ChatGPT were effective in addressing Hp-related questions, with ChatGPT-4o delivering the most accurate information. These findings suggest that artificial intelligence-driven chat models hold significant potential in healthcare, facilitating improved patient awareness, self-management, and treatment compliance, as well as supporting physicians in making informed medical decisions by providing accurate information and personalized recommendations.https://www.frontiersin.org/articles/10.3389/fpubh.2025.1566982/fullartificial intelligenceHelicobacter pylorilarge language modelpatient educationChatGPT |
| spellingShingle | Yi Ye En-dian Zheng Qiao-li Lan Le-can Wu Hao-yue Sun Bei-bei Xu Ying Wang Miao-miao Teng Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection Frontiers in Public Health artificial intelligence Helicobacter pylori large language model patient education ChatGPT |
| title | Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection |
| title_full | Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection |
| title_fullStr | Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection |
| title_full_unstemmed | Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection |
| title_short | Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection |
| title_sort | comparative evaluation of the accuracy and reliability of chatgpt versions in providing information on helicobacter pylori infection |
| topic | artificial intelligence Helicobacter pylori large language model patient education ChatGPT |
| url | https://www.frontiersin.org/articles/10.3389/fpubh.2025.1566982/full |
| work_keys_str_mv | AT yiye comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection AT endianzheng comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection AT qiaolilan comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection AT lecanwu comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection AT haoyuesun comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection AT beibeixu comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection AT yingwang comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection AT miaomiaoteng comparativeevaluationoftheaccuracyandreliabilityofchatgptversionsinprovidinginformationonhelicobacterpyloriinfection |