Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots

Background: The use of large language model (LLM) chatbots in health-related queries is growing due to their convenience and accessibility. However, concerns about the accuracy and readability of their information persist. Many individuals, including patients and healthy adults, may rely on chatbots...

Full description

Saved in:
Bibliographic Details
Main Authors: Himel Mondal, Devendra Nath Tiu, Shaikat Mondal, Rajib Dutta, Avijit Naskar, Indrashis Podder
Format: Article
Language:English
Published: Wolters Kluwer Medknow Publications 2025-01-01
Series:Journal of Mid-Life Health
Subjects:
Online Access:https://journals.lww.com/10.4103/jmh.jmh_182_24
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849323894407692288
author Himel Mondal
Devendra Nath Tiu
Shaikat Mondal
Rajib Dutta
Avijit Naskar
Indrashis Podder
author_facet Himel Mondal
Devendra Nath Tiu
Shaikat Mondal
Rajib Dutta
Avijit Naskar
Indrashis Podder
author_sort Himel Mondal
collection DOAJ
description Background: The use of large language model (LLM) chatbots in health-related queries is growing due to their convenience and accessibility. However, concerns about the accuracy and readability of their information persist. Many individuals, including patients and healthy adults, may rely on chatbots for midlife health queries instead of consulting a doctor. In this context, we evaluated the accuracy and readability of responses from six LLM chatbots to midlife health questions for men and women. Methods: Twenty questions on midlife health were asked to six different LLM chatbots – ChatGPT, Claude, Copilot, Gemini, Meta artificial intelligence (AI), and Perplexity. Each chatbot’s responses were collected and evaluated for accuracy, relevancy, fluency, and coherence by three independent expert physicians. An overall score was also calculated by taking the average of four criteria. In addition, readability was analyzed using the Flesch-Kincaid Grade Level, to determine how easily the information could be understood by the general population. Results: In terms of fluency, Perplexity scored the highest (4.3 ± 1.78), coherence was highest for Meta AI (4.26 ± 0.16), accuracy of responses was highest for Meta AI, and relevancy score was highest for Meta AI (4.35 ± 0.24). Overall, Meta AI scored the highest (4.28 ± 0.16), followed by ChatGPT (4.22 ± 0.21), whereas Copilot had the lowest score (3.72 ± 0.19) (P < 0.0001). Perplexity showed the highest score of 41.24 ± 10.57 in readability and lowest in grade level (11.11 ± 1.93), meaning its text is the easiest to read and requires a lower level of education. Conclusion: LLM chatbots can answer midlife-related health questions with variable capabilities. Meta AI was found to be highest scoring chatbot for addressing men’s and women’s midlife health questions, whereas Perplexity offers high readability for accessible information. Hence, LLM chatbots can be used as educational tools for midlife health by selecting appropriate chatbots according to its capability.
format Article
id doaj-art-8c8053e19a1f42afa37392ef6582ee39
institution Kabale University
issn 0976-7800
0976-7819
language English
publishDate 2025-01-01
publisher Wolters Kluwer Medknow Publications
record_format Article
series Journal of Mid-Life Health
spelling doaj-art-8c8053e19a1f42afa37392ef6582ee392025-08-20T03:48:52ZengWolters Kluwer Medknow PublicationsJournal of Mid-Life Health0976-78000976-78192025-01-01161455010.4103/jmh.jmh_182_24Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model ChatbotsHimel MondalDevendra Nath TiuShaikat MondalRajib DuttaAvijit NaskarIndrashis PodderBackground: The use of large language model (LLM) chatbots in health-related queries is growing due to their convenience and accessibility. However, concerns about the accuracy and readability of their information persist. Many individuals, including patients and healthy adults, may rely on chatbots for midlife health queries instead of consulting a doctor. In this context, we evaluated the accuracy and readability of responses from six LLM chatbots to midlife health questions for men and women. Methods: Twenty questions on midlife health were asked to six different LLM chatbots – ChatGPT, Claude, Copilot, Gemini, Meta artificial intelligence (AI), and Perplexity. Each chatbot’s responses were collected and evaluated for accuracy, relevancy, fluency, and coherence by three independent expert physicians. An overall score was also calculated by taking the average of four criteria. In addition, readability was analyzed using the Flesch-Kincaid Grade Level, to determine how easily the information could be understood by the general population. Results: In terms of fluency, Perplexity scored the highest (4.3 ± 1.78), coherence was highest for Meta AI (4.26 ± 0.16), accuracy of responses was highest for Meta AI, and relevancy score was highest for Meta AI (4.35 ± 0.24). Overall, Meta AI scored the highest (4.28 ± 0.16), followed by ChatGPT (4.22 ± 0.21), whereas Copilot had the lowest score (3.72 ± 0.19) (P < 0.0001). Perplexity showed the highest score of 41.24 ± 10.57 in readability and lowest in grade level (11.11 ± 1.93), meaning its text is the easiest to read and requires a lower level of education. Conclusion: LLM chatbots can answer midlife-related health questions with variable capabilities. Meta AI was found to be highest scoring chatbot for addressing men’s and women’s midlife health questions, whereas Perplexity offers high readability for accessible information. Hence, LLM chatbots can be used as educational tools for midlife health by selecting appropriate chatbots according to its capability.https://journals.lww.com/10.4103/jmh.jmh_182_24artificial intelligencechatbotshealth educationlarge language modelsmidlife healthpatient educationpatient queries
spellingShingle Himel Mondal
Devendra Nath Tiu
Shaikat Mondal
Rajib Dutta
Avijit Naskar
Indrashis Podder
Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots
Journal of Mid-Life Health
artificial intelligence
chatbots
health education
large language models
midlife health
patient education
patient queries
title Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots
title_full Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots
title_fullStr Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots
title_full_unstemmed Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots
title_short Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots
title_sort evaluating accuracy and readability of responses to midlife health questions a comparative analysis of six large language model chatbots
topic artificial intelligence
chatbots
health education
large language models
midlife health
patient education
patient queries
url https://journals.lww.com/10.4103/jmh.jmh_182_24
work_keys_str_mv AT himelmondal evaluatingaccuracyandreadabilityofresponsestomidlifehealthquestionsacomparativeanalysisofsixlargelanguagemodelchatbots
AT devendranathtiu evaluatingaccuracyandreadabilityofresponsestomidlifehealthquestionsacomparativeanalysisofsixlargelanguagemodelchatbots
AT shaikatmondal evaluatingaccuracyandreadabilityofresponsestomidlifehealthquestionsacomparativeanalysisofsixlargelanguagemodelchatbots
AT rajibdutta evaluatingaccuracyandreadabilityofresponsestomidlifehealthquestionsacomparativeanalysisofsixlargelanguagemodelchatbots
AT avijitnaskar evaluatingaccuracyandreadabilityofresponsestomidlifehealthquestionsacomparativeanalysisofsixlargelanguagemodelchatbots
AT indrashispodder evaluatingaccuracyandreadabilityofresponsestomidlifehealthquestionsacomparativeanalysisofsixlargelanguagemodelchatbots