Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots
Background: The use of large language model (LLM) chatbots in health-related queries is growing due to their convenience and accessibility. However, concerns about the accuracy and readability of their information persist. Many individuals, including patients and healthy adults, may rely on chatbots...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Wolters Kluwer Medknow Publications
2025-01-01
|
| Series: | Journal of Mid-Life Health |
| Subjects: | |
| Online Access: | https://journals.lww.com/10.4103/jmh.jmh_182_24 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849323894407692288 |
|---|---|
| author | Himel Mondal Devendra Nath Tiu Shaikat Mondal Rajib Dutta Avijit Naskar Indrashis Podder |
| author_facet | Himel Mondal Devendra Nath Tiu Shaikat Mondal Rajib Dutta Avijit Naskar Indrashis Podder |
| author_sort | Himel Mondal |
| collection | DOAJ |
| description | Background:
The use of large language model (LLM) chatbots in health-related queries is growing due to their convenience and accessibility. However, concerns about the accuracy and readability of their information persist. Many individuals, including patients and healthy adults, may rely on chatbots for midlife health queries instead of consulting a doctor. In this context, we evaluated the accuracy and readability of responses from six LLM chatbots to midlife health questions for men and women.
Methods:
Twenty questions on midlife health were asked to six different LLM chatbots – ChatGPT, Claude, Copilot, Gemini, Meta artificial intelligence (AI), and Perplexity. Each chatbot’s responses were collected and evaluated for accuracy, relevancy, fluency, and coherence by three independent expert physicians. An overall score was also calculated by taking the average of four criteria. In addition, readability was analyzed using the Flesch-Kincaid Grade Level, to determine how easily the information could be understood by the general population.
Results:
In terms of fluency, Perplexity scored the highest (4.3 ± 1.78), coherence was highest for Meta AI (4.26 ± 0.16), accuracy of responses was highest for Meta AI, and relevancy score was highest for Meta AI (4.35 ± 0.24). Overall, Meta AI scored the highest (4.28 ± 0.16), followed by ChatGPT (4.22 ± 0.21), whereas Copilot had the lowest score (3.72 ± 0.19) (P < 0.0001). Perplexity showed the highest score of 41.24 ± 10.57 in readability and lowest in grade level (11.11 ± 1.93), meaning its text is the easiest to read and requires a lower level of education.
Conclusion:
LLM chatbots can answer midlife-related health questions with variable capabilities. Meta AI was found to be highest scoring chatbot for addressing men’s and women’s midlife health questions, whereas Perplexity offers high readability for accessible information. Hence, LLM chatbots can be used as educational tools for midlife health by selecting appropriate chatbots according to its capability. |
| format | Article |
| id | doaj-art-8c8053e19a1f42afa37392ef6582ee39 |
| institution | Kabale University |
| issn | 0976-7800 0976-7819 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | Wolters Kluwer Medknow Publications |
| record_format | Article |
| series | Journal of Mid-Life Health |
| spelling | doaj-art-8c8053e19a1f42afa37392ef6582ee392025-08-20T03:48:52ZengWolters Kluwer Medknow PublicationsJournal of Mid-Life Health0976-78000976-78192025-01-01161455010.4103/jmh.jmh_182_24Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model ChatbotsHimel MondalDevendra Nath TiuShaikat MondalRajib DuttaAvijit NaskarIndrashis PodderBackground: The use of large language model (LLM) chatbots in health-related queries is growing due to their convenience and accessibility. However, concerns about the accuracy and readability of their information persist. Many individuals, including patients and healthy adults, may rely on chatbots for midlife health queries instead of consulting a doctor. In this context, we evaluated the accuracy and readability of responses from six LLM chatbots to midlife health questions for men and women. Methods: Twenty questions on midlife health were asked to six different LLM chatbots – ChatGPT, Claude, Copilot, Gemini, Meta artificial intelligence (AI), and Perplexity. Each chatbot’s responses were collected and evaluated for accuracy, relevancy, fluency, and coherence by three independent expert physicians. An overall score was also calculated by taking the average of four criteria. In addition, readability was analyzed using the Flesch-Kincaid Grade Level, to determine how easily the information could be understood by the general population. Results: In terms of fluency, Perplexity scored the highest (4.3 ± 1.78), coherence was highest for Meta AI (4.26 ± 0.16), accuracy of responses was highest for Meta AI, and relevancy score was highest for Meta AI (4.35 ± 0.24). Overall, Meta AI scored the highest (4.28 ± 0.16), followed by ChatGPT (4.22 ± 0.21), whereas Copilot had the lowest score (3.72 ± 0.19) (P < 0.0001). Perplexity showed the highest score of 41.24 ± 10.57 in readability and lowest in grade level (11.11 ± 1.93), meaning its text is the easiest to read and requires a lower level of education. Conclusion: LLM chatbots can answer midlife-related health questions with variable capabilities. Meta AI was found to be highest scoring chatbot for addressing men’s and women’s midlife health questions, whereas Perplexity offers high readability for accessible information. Hence, LLM chatbots can be used as educational tools for midlife health by selecting appropriate chatbots according to its capability.https://journals.lww.com/10.4103/jmh.jmh_182_24artificial intelligencechatbotshealth educationlarge language modelsmidlife healthpatient educationpatient queries |
| spellingShingle | Himel Mondal Devendra Nath Tiu Shaikat Mondal Rajib Dutta Avijit Naskar Indrashis Podder Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots Journal of Mid-Life Health artificial intelligence chatbots health education large language models midlife health patient education patient queries |
| title | Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots |
| title_full | Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots |
| title_fullStr | Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots |
| title_full_unstemmed | Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots |
| title_short | Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots |
| title_sort | evaluating accuracy and readability of responses to midlife health questions a comparative analysis of six large language model chatbots |
| topic | artificial intelligence chatbots health education large language models midlife health patient education patient queries |
| url | https://journals.lww.com/10.4103/jmh.jmh_182_24 |
| work_keys_str_mv | AT himelmondal evaluatingaccuracyandreadabilityofresponsestomidlifehealthquestionsacomparativeanalysisofsixlargelanguagemodelchatbots AT devendranathtiu evaluatingaccuracyandreadabilityofresponsestomidlifehealthquestionsacomparativeanalysisofsixlargelanguagemodelchatbots AT shaikatmondal evaluatingaccuracyandreadabilityofresponsestomidlifehealthquestionsacomparativeanalysisofsixlargelanguagemodelchatbots AT rajibdutta evaluatingaccuracyandreadabilityofresponsestomidlifehealthquestionsacomparativeanalysisofsixlargelanguagemodelchatbots AT avijitnaskar evaluatingaccuracyandreadabilityofresponsestomidlifehealthquestionsacomparativeanalysisofsixlargelanguagemodelchatbots AT indrashispodder evaluatingaccuracyandreadabilityofresponsestomidlifehealthquestionsacomparativeanalysisofsixlargelanguagemodelchatbots |