An Empirical Evaluation of Large Language Models on Consumer Health Questions

<b>Background:</b> Large Language Models (LLMs) have demonstrated strong performances in clinical question-answering (QA) benchmarks, yet their effectiveness in addressing real-world consumer medical queries remains underexplored. This study evaluates the capabilities and limitations of...

Full description

Saved in:

Bibliographic Details
Main Authors:	Moaiz Abrar, Yusuf Sermet, Ibrahim Demir
Format:	Article
Language:	English
Published:	MDPI AG 2025-02-01
Series:	BioMedInformatics
Subjects:	medical question answering consumer medical question answering natural language processing artificial intelligence large language models
Online Access:	https://www.mdpi.com/2673-7426/5/1/12
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850089674621583360
author	Moaiz Abrar Yusuf Sermet Ibrahim Demir
author_facet	Moaiz Abrar Yusuf Sermet Ibrahim Demir
author_sort	Moaiz Abrar
collection	DOAJ
description	<b>Background:</b> Large Language Models (LLMs) have demonstrated strong performances in clinical question-answering (QA) benchmarks, yet their effectiveness in addressing real-world consumer medical queries remains underexplored. This study evaluates the capabilities and limitations of LLMs in answering consumer health questions using the MedRedQA dataset, which consists of medical questions and answers by verified experts from the AskDocs subreddit. <b>Methods:</b> Five LLMs-GPT-4o mini, Llama 3.1-70B, Mistral-123B, Mistral-7B, and Gemini-Flash were assessed using a cross-evaluation framework. Each model generated responses to consumer queries and their outputs were evaluated by every model by comparing them with expert responses. Human evaluation was used to assess the reliability of models as evaluators. <b>Results:</b> GPT-4o mini achieved the highest alignment with expert responses according to four out of the five models’ judges, while Mistral-7B scored the lowest according to three out of five models’ judges. Overall, model responses show low alignment with expert responses. <b>Conclusions:</b> Current small or medium sized LLMs struggle to provide accurate answers to consumer health questions and must be significantly improved.
format	Article
id	doaj-art-02e27efa37644e079d6fb5e506ea2aa2
institution	DOAJ
issn	2673-7426
language	English
publishDate	2025-02-01
publisher	MDPI AG
record_format	Article
series	BioMedInformatics
spelling	doaj-art-02e27efa37644e079d6fb5e506ea2aa22025-08-20T02:42:42ZengMDPI AGBioMedInformatics2673-74262025-02-01511210.3390/biomedinformatics5010012An Empirical Evaluation of Large Language Models on Consumer Health QuestionsMoaiz Abrar0Yusuf Sermet1Ibrahim Demir2IIHR Hydroscience and Engineering, University of Iowa, Iowa City, IA 52246, USAIIHR Hydroscience and Engineering, University of Iowa, Iowa City, IA 52246, USARiver-Coastal Science and Engineering, Tulane University, New Orleans, LA 70118, USA<b>Background:</b> Large Language Models (LLMs) have demonstrated strong performances in clinical question-answering (QA) benchmarks, yet their effectiveness in addressing real-world consumer medical queries remains underexplored. This study evaluates the capabilities and limitations of LLMs in answering consumer health questions using the MedRedQA dataset, which consists of medical questions and answers by verified experts from the AskDocs subreddit. <b>Methods:</b> Five LLMs-GPT-4o mini, Llama 3.1-70B, Mistral-123B, Mistral-7B, and Gemini-Flash were assessed using a cross-evaluation framework. Each model generated responses to consumer queries and their outputs were evaluated by every model by comparing them with expert responses. Human evaluation was used to assess the reliability of models as evaluators. <b>Results:</b> GPT-4o mini achieved the highest alignment with expert responses according to four out of the five models’ judges, while Mistral-7B scored the lowest according to three out of five models’ judges. Overall, model responses show low alignment with expert responses. <b>Conclusions:</b> Current small or medium sized LLMs struggle to provide accurate answers to consumer health questions and must be significantly improved.https://www.mdpi.com/2673-7426/5/1/12medical question answeringconsumer medical question answeringnatural language processingartificial intelligencelarge language models
spellingShingle	Moaiz Abrar Yusuf Sermet Ibrahim Demir An Empirical Evaluation of Large Language Models on Consumer Health Questions BioMedInformatics medical question answering consumer medical question answering natural language processing artificial intelligence large language models
title	An Empirical Evaluation of Large Language Models on Consumer Health Questions
title_full	An Empirical Evaluation of Large Language Models on Consumer Health Questions
title_fullStr	An Empirical Evaluation of Large Language Models on Consumer Health Questions
title_full_unstemmed	An Empirical Evaluation of Large Language Models on Consumer Health Questions
title_short	An Empirical Evaluation of Large Language Models on Consumer Health Questions
title_sort	empirical evaluation of large language models on consumer health questions
topic	medical question answering consumer medical question answering natural language processing artificial intelligence large language models
url	https://www.mdpi.com/2673-7426/5/1/12
work_keys_str_mv	AT moaizabrar anempiricalevaluationoflargelanguagemodelsonconsumerhealthquestions AT yusufsermet anempiricalevaluationoflargelanguagemodelsonconsumerhealthquestions AT ibrahimdemir anempiricalevaluationoflargelanguagemodelsonconsumerhealthquestions AT moaizabrar empiricalevaluationoflargelanguagemodelsonconsumerhealthquestions AT yusufsermet empiricalevaluationoflargelanguagemodelsonconsumerhealthquestions AT ibrahimdemir empiricalevaluationoflargelanguagemodelsonconsumerhealthquestions

An Empirical Evaluation of Large Language Models on Consumer Health Questions

Similar Items