Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study

Abstract BackgroundLarge language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchm...

Full description

Saved in:

Bibliographic Details
Main Authors:	Han Yang, Mingchen Li, Huixue Zhou, Yongkang Xiao, Qian Fang, Shuang Zhou, Rui Zhang
Format:	Article
Language:	English
Published:	JMIR Publications 2025-07-01
Series:	Journal of Medical Internet Research
Online Access:	https://www.jmir.org/2025/1/e70080
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849317033228894208
author	Han Yang Mingchen Li Huixue Zhou Yongkang Xiao Qian Fang Shuang Zhou Rui Zhang
author_facet	Han Yang Mingchen Li Huixue Zhou Yongkang Xiao Qian Fang Shuang Zhou Rui Zhang
author_sort	Han Yang
collection	DOAJ
description	Abstract BackgroundLarge language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchmarked individual zero-shot LLMs (GPT-4, Llama2-13B, Vicuna-13B, MedLlama-13B, and MedAlpaca-13B) to assess their baseline performance. Within the benchmark, GPT-4 achieves the best 71% on MedMCQA (medical multiple-choice question answering dataset), Vicuna-13B achieves 89.5% on PubMedQA (a dataset for biomedical question answering), and MedAlpaca-13B achieves the best 70% among all, showing the potential for better performance across different tasks and highlighting the need for strategies that can harness their collective strengths. Ensemble learning methods, combining multiple models to improve overall accuracy and reliability, offer a promising approach to address this challenge. ObjectiveTo develop and evaluate efficient ensemble learning approaches, we focus on improving performance across 3 medical QA datasets through our proposed two ensemble strategies. MethodsOur study uses 3 medical QA datasets: PubMedQA (1000 manually labeled and 11,269 test, with yes, no, or maybe answered for each question), MedQA-USMLE (Medical Question Answering dataset based on the United States Medical Licensing Examination; 12,724 English board-style questions; 1272 test, 5 options), and MedMCQA (182,822 training/4183 test questions, 4-option multiple choice). We introduced the LLM-Synergy framework, consisting of two ensemble methods: (1) a Boosting-based Weighted Majority Vote ensemble, refining decision-making by adaptively weighting each LLM and (2) a Cluster-based Dynamic Model Selection ensemble, dynamically selecting optimal LLMs for each query based on question-context embeddings and clustering. ResultsBoth ensemble methods outperformed individual LLMs across all 3 datasets. Specifically comparing the best individual LLM, the Boosting-based Majority Weighted Vote achieved accuracies of 35.84% on MedMCQA (+3.81%), 96.21% on PubMedQA (+0.64%), and 37.26% (tie) on MedQA-USMLE. The Cluster-based Dynamic Model Selection yields even higher accuracies of 38.01% (+5.98%) for MedMCQA, 96.36% (+1.09%) for PubMedQA, and 38.13% (+0.87%) for MedQA-USMLE. ConclusionsThe LLM-Synergy framework, using 2 ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks. Through effectively combining the strengths of diverse LLMs, this framework provides a flexible and efficient strategy adaptable to current and future challenges in biomedical informatics.
format	Article
id	doaj-art-e82549486dcb4fa584be66c184ca0aff
institution	Kabale University
issn	1438-8871
language	English
publishDate	2025-07-01
publisher	JMIR Publications
record_format	Article
series	Journal of Medical Internet Research
spelling	doaj-art-e82549486dcb4fa584be66c184ca0aff2025-08-20T03:51:24ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-07-0127e70080e7008010.2196/70080Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation StudyHan Yanghttp://orcid.org/0009-0000-4322-6753Mingchen Lihttp://orcid.org/0009-0007-3999-0450Huixue Zhouhttp://orcid.org/0000-0002-6524-5506Yongkang Xiaohttp://orcid.org/0000-0002-8808-8371Qian Fanghttp://orcid.org/0009-0002-9439-9210Shuang Zhouhttp://orcid.org/0000-0001-5739-1637Rui Zhanghttp://orcid.org/0000-0001-8258-3585 Abstract BackgroundLarge language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchmarked individual zero-shot LLMs (GPT-4, Llama2-13B, Vicuna-13B, MedLlama-13B, and MedAlpaca-13B) to assess their baseline performance. Within the benchmark, GPT-4 achieves the best 71% on MedMCQA (medical multiple-choice question answering dataset), Vicuna-13B achieves 89.5% on PubMedQA (a dataset for biomedical question answering), and MedAlpaca-13B achieves the best 70% among all, showing the potential for better performance across different tasks and highlighting the need for strategies that can harness their collective strengths. Ensemble learning methods, combining multiple models to improve overall accuracy and reliability, offer a promising approach to address this challenge. ObjectiveTo develop and evaluate efficient ensemble learning approaches, we focus on improving performance across 3 medical QA datasets through our proposed two ensemble strategies. MethodsOur study uses 3 medical QA datasets: PubMedQA (1000 manually labeled and 11,269 test, with yes, no, or maybe answered for each question), MedQA-USMLE (Medical Question Answering dataset based on the United States Medical Licensing Examination; 12,724 English board-style questions; 1272 test, 5 options), and MedMCQA (182,822 training/4183 test questions, 4-option multiple choice). We introduced the LLM-Synergy framework, consisting of two ensemble methods: (1) a Boosting-based Weighted Majority Vote ensemble, refining decision-making by adaptively weighting each LLM and (2) a Cluster-based Dynamic Model Selection ensemble, dynamically selecting optimal LLMs for each query based on question-context embeddings and clustering. ResultsBoth ensemble methods outperformed individual LLMs across all 3 datasets. Specifically comparing the best individual LLM, the Boosting-based Majority Weighted Vote achieved accuracies of 35.84% on MedMCQA (+3.81%), 96.21% on PubMedQA (+0.64%), and 37.26% (tie) on MedQA-USMLE. The Cluster-based Dynamic Model Selection yields even higher accuracies of 38.01% (+5.98%) for MedMCQA, 96.36% (+1.09%) for PubMedQA, and 38.13% (+0.87%) for MedQA-USMLE. ConclusionsThe LLM-Synergy framework, using 2 ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks. Through effectively combining the strengths of diverse LLMs, this framework provides a flexible and efficient strategy adaptable to current and future challenges in biomedical informatics.https://www.jmir.org/2025/1/e70080
spellingShingle	Han Yang Mingchen Li Huixue Zhou Yongkang Xiao Qian Fang Shuang Zhou Rui Zhang Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study Journal of Medical Internet Research
title	Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study
title_full	Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study
title_fullStr	Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study
title_full_unstemmed	Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study
title_short	Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study
title_sort	large language model synergy for ensemble learning in medical question answering design and evaluation study
url	https://www.jmir.org/2025/1/e70080
work_keys_str_mv	AT hanyang largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy AT mingchenli largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy AT huixuezhou largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy AT yongkangxiao largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy AT qianfang largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy AT shuangzhou largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy AT ruizhang largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy

Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study

Similar Items