Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study

Abstract BackgroundLarge language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchm...

Full description

Saved in:
Bibliographic Details
Main Authors: Han Yang, Mingchen Li, Huixue Zhou, Yongkang Xiao, Qian Fang, Shuang Zhou, Rui Zhang
Format: Article
Language:English
Published: JMIR Publications 2025-07-01
Series:Journal of Medical Internet Research
Online Access:https://www.jmir.org/2025/1/e70080
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849317033228894208
author Han Yang
Mingchen Li
Huixue Zhou
Yongkang Xiao
Qian Fang
Shuang Zhou
Rui Zhang
author_facet Han Yang
Mingchen Li
Huixue Zhou
Yongkang Xiao
Qian Fang
Shuang Zhou
Rui Zhang
author_sort Han Yang
collection DOAJ
description Abstract BackgroundLarge language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchmarked individual zero-shot LLMs (GPT-4, Llama2-13B, Vicuna-13B, MedLlama-13B, and MedAlpaca-13B) to assess their baseline performance. Within the benchmark, GPT-4 achieves the best 71% on MedMCQA (medical multiple-choice question answering dataset), Vicuna-13B achieves 89.5% on PubMedQA (a dataset for biomedical question answering), and MedAlpaca-13B achieves the best 70% among all, showing the potential for better performance across different tasks and highlighting the need for strategies that can harness their collective strengths. Ensemble learning methods, combining multiple models to improve overall accuracy and reliability, offer a promising approach to address this challenge. ObjectiveTo develop and evaluate efficient ensemble learning approaches, we focus on improving performance across 3 medical QA datasets through our proposed two ensemble strategies. MethodsOur study uses 3 medical QA datasets: PubMedQA (1000 manually labeled and 11,269 test, with yes, no, or maybe answered for each question), MedQA-USMLE (Medical Question Answering dataset based on the United States Medical Licensing Examination; 12,724 English board-style questions; 1272 test, 5 options), and MedMCQA (182,822 training/4183 test questions, 4-option multiple choice). We introduced the LLM-Synergy framework, consisting of two ensemble methods: (1) a Boosting-based Weighted Majority Vote ensemble, refining decision-making by adaptively weighting each LLM and (2) a Cluster-based Dynamic Model Selection ensemble, dynamically selecting optimal LLMs for each query based on question-context embeddings and clustering. ResultsBoth ensemble methods outperformed individual LLMs across all 3 datasets. Specifically comparing the best individual LLM, the Boosting-based Majority Weighted Vote achieved accuracies of 35.84% on MedMCQA (+3.81%), 96.21% on PubMedQA (+0.64%), and 37.26% (tie) on MedQA-USMLE. The Cluster-based Dynamic Model Selection yields even higher accuracies of 38.01% (+5.98%) for MedMCQA, 96.36% (+1.09%) for PubMedQA, and 38.13% (+0.87%) for MedQA-USMLE. ConclusionsThe LLM-Synergy framework, using 2 ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks. Through effectively combining the strengths of diverse LLMs, this framework provides a flexible and efficient strategy adaptable to current and future challenges in biomedical informatics.
format Article
id doaj-art-e82549486dcb4fa584be66c184ca0aff
institution Kabale University
issn 1438-8871
language English
publishDate 2025-07-01
publisher JMIR Publications
record_format Article
series Journal of Medical Internet Research
spelling doaj-art-e82549486dcb4fa584be66c184ca0aff2025-08-20T03:51:24ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-07-0127e70080e7008010.2196/70080Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation StudyHan Yanghttp://orcid.org/0009-0000-4322-6753Mingchen Lihttp://orcid.org/0009-0007-3999-0450Huixue Zhouhttp://orcid.org/0000-0002-6524-5506Yongkang Xiaohttp://orcid.org/0000-0002-8808-8371Qian Fanghttp://orcid.org/0009-0002-9439-9210Shuang Zhouhttp://orcid.org/0000-0001-5739-1637Rui Zhanghttp://orcid.org/0000-0001-8258-3585 Abstract BackgroundLarge language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchmarked individual zero-shot LLMs (GPT-4, Llama2-13B, Vicuna-13B, MedLlama-13B, and MedAlpaca-13B) to assess their baseline performance. Within the benchmark, GPT-4 achieves the best 71% on MedMCQA (medical multiple-choice question answering dataset), Vicuna-13B achieves 89.5% on PubMedQA (a dataset for biomedical question answering), and MedAlpaca-13B achieves the best 70% among all, showing the potential for better performance across different tasks and highlighting the need for strategies that can harness their collective strengths. Ensemble learning methods, combining multiple models to improve overall accuracy and reliability, offer a promising approach to address this challenge. ObjectiveTo develop and evaluate efficient ensemble learning approaches, we focus on improving performance across 3 medical QA datasets through our proposed two ensemble strategies. MethodsOur study uses 3 medical QA datasets: PubMedQA (1000 manually labeled and 11,269 test, with yes, no, or maybe answered for each question), MedQA-USMLE (Medical Question Answering dataset based on the United States Medical Licensing Examination; 12,724 English board-style questions; 1272 test, 5 options), and MedMCQA (182,822 training/4183 test questions, 4-option multiple choice). We introduced the LLM-Synergy framework, consisting of two ensemble methods: (1) a Boosting-based Weighted Majority Vote ensemble, refining decision-making by adaptively weighting each LLM and (2) a Cluster-based Dynamic Model Selection ensemble, dynamically selecting optimal LLMs for each query based on question-context embeddings and clustering. ResultsBoth ensemble methods outperformed individual LLMs across all 3 datasets. Specifically comparing the best individual LLM, the Boosting-based Majority Weighted Vote achieved accuracies of 35.84% on MedMCQA (+3.81%), 96.21% on PubMedQA (+0.64%), and 37.26% (tie) on MedQA-USMLE. The Cluster-based Dynamic Model Selection yields even higher accuracies of 38.01% (+5.98%) for MedMCQA, 96.36% (+1.09%) for PubMedQA, and 38.13% (+0.87%) for MedQA-USMLE. ConclusionsThe LLM-Synergy framework, using 2 ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks. Through effectively combining the strengths of diverse LLMs, this framework provides a flexible and efficient strategy adaptable to current and future challenges in biomedical informatics.https://www.jmir.org/2025/1/e70080
spellingShingle Han Yang
Mingchen Li
Huixue Zhou
Yongkang Xiao
Qian Fang
Shuang Zhou
Rui Zhang
Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study
Journal of Medical Internet Research
title Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study
title_full Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study
title_fullStr Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study
title_full_unstemmed Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study
title_short Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study
title_sort large language model synergy for ensemble learning in medical question answering design and evaluation study
url https://www.jmir.org/2025/1/e70080
work_keys_str_mv AT hanyang largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy
AT mingchenli largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy
AT huixuezhou largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy
AT yongkangxiao largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy
AT qianfang largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy
AT shuangzhou largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy
AT ruizhang largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy