Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study
Abstract BackgroundLarge language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchm...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
JMIR Publications
2025-07-01
|
| Series: | Journal of Medical Internet Research |
| Online Access: | https://www.jmir.org/2025/1/e70080 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849317033228894208 |
|---|---|
| author | Han Yang Mingchen Li Huixue Zhou Yongkang Xiao Qian Fang Shuang Zhou Rui Zhang |
| author_facet | Han Yang Mingchen Li Huixue Zhou Yongkang Xiao Qian Fang Shuang Zhou Rui Zhang |
| author_sort | Han Yang |
| collection | DOAJ |
| description |
Abstract
BackgroundLarge language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchmarked individual zero-shot LLMs (GPT-4, Llama2-13B, Vicuna-13B, MedLlama-13B, and MedAlpaca-13B) to assess their baseline performance. Within the benchmark, GPT-4 achieves the best 71% on MedMCQA (medical multiple-choice question answering dataset), Vicuna-13B achieves 89.5% on PubMedQA (a dataset for biomedical question answering), and MedAlpaca-13B achieves the best 70% among all, showing the potential for better performance across different tasks and highlighting the need for strategies that can harness their collective strengths. Ensemble learning methods, combining multiple models to improve overall accuracy and reliability, offer a promising approach to address this challenge.
ObjectiveTo develop and evaluate efficient ensemble learning approaches, we focus on improving performance across 3 medical QA datasets through our proposed two ensemble strategies.
MethodsOur study uses 3 medical QA datasets: PubMedQA (1000 manually labeled and 11,269 test, with yes, no, or maybe answered for each question), MedQA-USMLE (Medical Question Answering dataset based on the United States Medical Licensing Examination; 12,724 English board-style questions; 1272 test, 5 options), and MedMCQA (182,822 training/4183 test questions, 4-option multiple choice). We introduced the LLM-Synergy framework, consisting of two ensemble methods: (1) a Boosting-based Weighted Majority Vote ensemble, refining decision-making by adaptively weighting each LLM and (2) a Cluster-based Dynamic Model Selection ensemble, dynamically selecting optimal LLMs for each query based on question-context embeddings and clustering.
ResultsBoth ensemble methods outperformed individual LLMs across all 3 datasets. Specifically comparing the best individual LLM, the Boosting-based Majority Weighted Vote achieved accuracies of 35.84% on MedMCQA (+3.81%), 96.21% on PubMedQA (+0.64%), and 37.26% (tie) on MedQA-USMLE. The Cluster-based Dynamic Model Selection yields even higher accuracies of 38.01% (+5.98%) for MedMCQA, 96.36% (+1.09%) for PubMedQA, and 38.13% (+0.87%) for MedQA-USMLE.
ConclusionsThe LLM-Synergy framework, using 2 ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks. Through effectively combining the strengths of diverse LLMs, this framework provides a flexible and efficient strategy adaptable to current and future challenges in biomedical informatics. |
| format | Article |
| id | doaj-art-e82549486dcb4fa584be66c184ca0aff |
| institution | Kabale University |
| issn | 1438-8871 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | JMIR Publications |
| record_format | Article |
| series | Journal of Medical Internet Research |
| spelling | doaj-art-e82549486dcb4fa584be66c184ca0aff2025-08-20T03:51:24ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-07-0127e70080e7008010.2196/70080Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation StudyHan Yanghttp://orcid.org/0009-0000-4322-6753Mingchen Lihttp://orcid.org/0009-0007-3999-0450Huixue Zhouhttp://orcid.org/0000-0002-6524-5506Yongkang Xiaohttp://orcid.org/0000-0002-8808-8371Qian Fanghttp://orcid.org/0009-0002-9439-9210Shuang Zhouhttp://orcid.org/0000-0001-5739-1637Rui Zhanghttp://orcid.org/0000-0001-8258-3585 Abstract BackgroundLarge language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchmarked individual zero-shot LLMs (GPT-4, Llama2-13B, Vicuna-13B, MedLlama-13B, and MedAlpaca-13B) to assess their baseline performance. Within the benchmark, GPT-4 achieves the best 71% on MedMCQA (medical multiple-choice question answering dataset), Vicuna-13B achieves 89.5% on PubMedQA (a dataset for biomedical question answering), and MedAlpaca-13B achieves the best 70% among all, showing the potential for better performance across different tasks and highlighting the need for strategies that can harness their collective strengths. Ensemble learning methods, combining multiple models to improve overall accuracy and reliability, offer a promising approach to address this challenge. ObjectiveTo develop and evaluate efficient ensemble learning approaches, we focus on improving performance across 3 medical QA datasets through our proposed two ensemble strategies. MethodsOur study uses 3 medical QA datasets: PubMedQA (1000 manually labeled and 11,269 test, with yes, no, or maybe answered for each question), MedQA-USMLE (Medical Question Answering dataset based on the United States Medical Licensing Examination; 12,724 English board-style questions; 1272 test, 5 options), and MedMCQA (182,822 training/4183 test questions, 4-option multiple choice). We introduced the LLM-Synergy framework, consisting of two ensemble methods: (1) a Boosting-based Weighted Majority Vote ensemble, refining decision-making by adaptively weighting each LLM and (2) a Cluster-based Dynamic Model Selection ensemble, dynamically selecting optimal LLMs for each query based on question-context embeddings and clustering. ResultsBoth ensemble methods outperformed individual LLMs across all 3 datasets. Specifically comparing the best individual LLM, the Boosting-based Majority Weighted Vote achieved accuracies of 35.84% on MedMCQA (+3.81%), 96.21% on PubMedQA (+0.64%), and 37.26% (tie) on MedQA-USMLE. The Cluster-based Dynamic Model Selection yields even higher accuracies of 38.01% (+5.98%) for MedMCQA, 96.36% (+1.09%) for PubMedQA, and 38.13% (+0.87%) for MedQA-USMLE. ConclusionsThe LLM-Synergy framework, using 2 ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks. Through effectively combining the strengths of diverse LLMs, this framework provides a flexible and efficient strategy adaptable to current and future challenges in biomedical informatics.https://www.jmir.org/2025/1/e70080 |
| spellingShingle | Han Yang Mingchen Li Huixue Zhou Yongkang Xiao Qian Fang Shuang Zhou Rui Zhang Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study Journal of Medical Internet Research |
| title | Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study |
| title_full | Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study |
| title_fullStr | Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study |
| title_full_unstemmed | Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study |
| title_short | Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study |
| title_sort | large language model synergy for ensemble learning in medical question answering design and evaluation study |
| url | https://www.jmir.org/2025/1/e70080 |
| work_keys_str_mv | AT hanyang largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy AT mingchenli largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy AT huixuezhou largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy AT yongkangxiao largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy AT qianfang largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy AT shuangzhou largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy AT ruizhang largelanguagemodelsynergyforensemblelearninginmedicalquestionansweringdesignandevaluationstudy |