Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news

Abstract Background With the growing prevalence of health misinformation online, there is an urgent need for tools that can reliably assist the public in evaluating the quality of health information. This study investigates the performance of GPT-3.5-Turbo, a representative and widely used large lan...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xiaoyu Liu, Lu He, Eman Alanazi, Echu Liu, Arianna Goss, Lionel Gumireddy
Format:	Article
Language:	English
Published:	BMC 2025-06-01
Series:	BMC Public Health
Subjects:	Generative AI Large Language model ChatGPT GPT-3 Health misinformation Health news
Online Access:	https://doi.org/10.1186/s12889-025-23206-0
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850225734000312320
author	Xiaoyu Liu Lu He Eman Alanazi Echu Liu Arianna Goss Lionel Gumireddy
author_facet	Xiaoyu Liu Lu He Eman Alanazi Echu Liu Arianna Goss Lionel Gumireddy
author_sort	Xiaoyu Liu
collection	DOAJ
description	Abstract Background With the growing prevalence of health misinformation online, there is an urgent need for tools that can reliably assist the public in evaluating the quality of health information. This study investigates the performance of GPT-3.5-Turbo, a representative and widely used large language model (LLM), in rating the quality of health news and providing explanatory justification for the rating assessment. Methods We evaluated GPT-3.5-Turbo’s performance on 3222 health news articles from an expert-annotated dataset compiled by HealthNewsReview.org, which assesses the quality of health news across nine criteria. GPT-3.5-Turbo was prompted with standardized queries tailored to each criterion. We measured its rating performance using 95% confidence intervals for precision, recall, and F1 scores in binary classification (satisfactory/not satisfactory). Additionally, linguistic complexity, readability, and the quality of GPT-3.5-Turbo’s explainability were assessed through both quantitative linguistic analysis and qualitative evaluation of consistency and contextual relevance. Results GPT-3.5-Turbo’s rating performance varied across criteria, with the highest accuracy for the Cost criterion (F1 = 0.824) but lower accuracy for Benefit, Conflict, and Quality criteria (F1 < 0.5), underperforming compared to traditional supervised machine learning models. However, its explanations were clear, with readability suited to late high school or early college levels and scored highly for consistency (average score: 2.90/3) and contextual relevance (average score: 2.73/3). These findings highlight GPT-3.5-Turbo’s strength in providing understandable and contextually relevant explanations, despite that its rating accuracy is limited. Conclusion While GPT-3.5-Turbo’s rating accuracy requires improvement, its strength in offering comprehensible and contextually relevant explanations presents a valuable opportunity to enhance public understanding of health news quality. Leveraging LLMs as complementary tools for health literacy initiatives could help mitigate misinformation by facilitating non-expert audiences to interpret and assess health information.
format	Article
id	doaj-art-afb7e711228849ff95c13fedb4234b70
institution	OA Journals
issn	1471-2458
language	English
publishDate	2025-06-01
publisher	BMC
record_format	Article
series	BMC Public Health
spelling	doaj-art-afb7e711228849ff95c13fedb4234b702025-08-20T02:05:16ZengBMCBMC Public Health1471-24582025-06-0125111310.1186/s12889-025-23206-0Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health newsXiaoyu Liu0Lu He1Eman Alanazi2Echu Liu3Arianna Goss4Lionel Gumireddy5College for Public Health and Social Justice, Saint Louis UniversityZilber College of Public Health, University of Wisconsin-MilwaukeeCollege of Health Sciences, Saudi Electronic UniversityCollege for Public Health and Social Justice, Saint Louis UniversityCollege for Public Health and Social Justice, Saint Louis UniversityCollege for Public Health and Social Justice, Saint Louis UniversityAbstract Background With the growing prevalence of health misinformation online, there is an urgent need for tools that can reliably assist the public in evaluating the quality of health information. This study investigates the performance of GPT-3.5-Turbo, a representative and widely used large language model (LLM), in rating the quality of health news and providing explanatory justification for the rating assessment. Methods We evaluated GPT-3.5-Turbo’s performance on 3222 health news articles from an expert-annotated dataset compiled by HealthNewsReview.org, which assesses the quality of health news across nine criteria. GPT-3.5-Turbo was prompted with standardized queries tailored to each criterion. We measured its rating performance using 95% confidence intervals for precision, recall, and F1 scores in binary classification (satisfactory/not satisfactory). Additionally, linguistic complexity, readability, and the quality of GPT-3.5-Turbo’s explainability were assessed through both quantitative linguistic analysis and qualitative evaluation of consistency and contextual relevance. Results GPT-3.5-Turbo’s rating performance varied across criteria, with the highest accuracy for the Cost criterion (F1 = 0.824) but lower accuracy for Benefit, Conflict, and Quality criteria (F1 < 0.5), underperforming compared to traditional supervised machine learning models. However, its explanations were clear, with readability suited to late high school or early college levels and scored highly for consistency (average score: 2.90/3) and contextual relevance (average score: 2.73/3). These findings highlight GPT-3.5-Turbo’s strength in providing understandable and contextually relevant explanations, despite that its rating accuracy is limited. Conclusion While GPT-3.5-Turbo’s rating accuracy requires improvement, its strength in offering comprehensible and contextually relevant explanations presents a valuable opportunity to enhance public understanding of health news quality. Leveraging LLMs as complementary tools for health literacy initiatives could help mitigate misinformation by facilitating non-expert audiences to interpret and assess health information.https://doi.org/10.1186/s12889-025-23206-0Generative AILarge Language modelChatGPTGPT-3Health misinformationHealth news
spellingShingle	Xiaoyu Liu Lu He Eman Alanazi Echu Liu Arianna Goss Lionel Gumireddy Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news BMC Public Health Generative AI Large Language model ChatGPT GPT-3 Health misinformation Health news
title	Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news
title_full	Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news
title_fullStr	Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news
title_full_unstemmed	Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news
title_short	Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news
title_sort	assessing the accuracy and explainability of using chatgpt to evaluate the quality of health news
topic	Generative AI Large Language model ChatGPT GPT-3 Health misinformation Health news
url	https://doi.org/10.1186/s12889-025-23206-0
work_keys_str_mv	AT xiaoyuliu assessingtheaccuracyandexplainabilityofusingchatgpttoevaluatethequalityofhealthnews AT luhe assessingtheaccuracyandexplainabilityofusingchatgpttoevaluatethequalityofhealthnews AT emanalanazi assessingtheaccuracyandexplainabilityofusingchatgpttoevaluatethequalityofhealthnews AT echuliu assessingtheaccuracyandexplainabilityofusingchatgpttoevaluatethequalityofhealthnews AT ariannagoss assessingtheaccuracyandexplainabilityofusingchatgpttoevaluatethequalityofhealthnews AT lionelgumireddy assessingtheaccuracyandexplainabilityofusingchatgpttoevaluatethequalityofhealthnews

Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news

Similar Items