Medical LLMs: Fine-Tuning vs. Retrieval-Augmented Generation

Large language models (LLMs) are trained on huge datasets, which allow them to answer questions from various domains. However, their expertise is confined to the data that they were trained on. In order to specialize LLMs in niche domains like healthcare, various training methods can be employed. Tw...

Full description

Saved in:

Bibliographic Details
Main Authors:	Bhagyajit Pingua, Adyakanta Sahoo, Meenakshi Kandpal, Deepak Murmu, Jyotirmayee Rautaray, Rabindra Kumar Barik, Manob Jyoti Saikia
Format:	Article
Language:	English
Published:	MDPI AG 2025-06-01
Series:	Bioengineering
Subjects:	large language models healthcare medical retrieval-augmented generation fine-tuning
Online Access:	https://www.mdpi.com/2306-5354/12/7/687
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849419202317778944
author	Bhagyajit Pingua Adyakanta Sahoo Meenakshi Kandpal Deepak Murmu Jyotirmayee Rautaray Rabindra Kumar Barik Manob Jyoti Saikia
author_facet	Bhagyajit Pingua Adyakanta Sahoo Meenakshi Kandpal Deepak Murmu Jyotirmayee Rautaray Rabindra Kumar Barik Manob Jyoti Saikia
author_sort	Bhagyajit Pingua
collection	DOAJ
description	Large language models (LLMs) are trained on huge datasets, which allow them to answer questions from various domains. However, their expertise is confined to the data that they were trained on. In order to specialize LLMs in niche domains like healthcare, various training methods can be employed. Two of these commonly known approaches are retrieval-augmented Generation and model fine-tuning. Five models—Llama-3.1-8B, Gemma-2-9B, Mistral-7B-Instruct, Qwen2.5-7B, and Phi-3.5-Mini-Instruct—were fine-tuned on healthcare data. These models were trained using three distinct approaches: retrieval-augmented generation (RAG) alone, fine-tuning (FT) alone, and a combination of both (FT+RAG) on the MedQuAD dataset, which covers a wide range of medical topics including disease symptoms, treatments, medications, and more. Our findings revealed that RAG and FT+RAG consistently outperformed FT alone across most models, particularly LLAMA and PHI. LLAMA and PHI excelled across multiple metrics, with LLAMA showing superior overall performance and PHI demonstrating strong RAG/FT+RAG capabilities. QWEN lagged behind in most metrics, while GEMMA and MISTRAL showed mixed results.
format	Article
id	doaj-art-dea863ab480f42748158c8fca92a1178
institution	Kabale University
issn	2306-5354
language	English
publishDate	2025-06-01
publisher	MDPI AG
record_format	Article
series	Bioengineering
spelling	doaj-art-dea863ab480f42748158c8fca92a11782025-08-20T03:32:12ZengMDPI AGBioengineering2306-53542025-06-0112768710.3390/bioengineering12070687Medical LLMs: Fine-Tuning vs. Retrieval-Augmented GenerationBhagyajit Pingua0Adyakanta Sahoo1Meenakshi Kandpal2Deepak Murmu3Jyotirmayee Rautaray4Rabindra Kumar Barik5Manob Jyoti Saikia6Biomedical Sensors & Systems Lab, University of Memphis, Memphis, TN 38152, USABiomedical Sensors & Systems Lab, University of Memphis, Memphis, TN 38152, USADepartment of Computer Science and Engineering, Odisha University of Technology and Research, Bhubaneswar 751003, IndiaDepartment of Computer Science and Engineering, Odisha University of Technology and Research, Bhubaneswar 751003, IndiaDepartment of Computer Science and Engineering, Odisha University of Technology and Research, Bhubaneswar 751003, IndiaSchool of Computer Applications, KIIT Deemed to be University, Bhubaneswar 751003, IndiaBiomedical Sensors & Systems Lab, University of Memphis, Memphis, TN 38152, USALarge language models (LLMs) are trained on huge datasets, which allow them to answer questions from various domains. However, their expertise is confined to the data that they were trained on. In order to specialize LLMs in niche domains like healthcare, various training methods can be employed. Two of these commonly known approaches are retrieval-augmented Generation and model fine-tuning. Five models—Llama-3.1-8B, Gemma-2-9B, Mistral-7B-Instruct, Qwen2.5-7B, and Phi-3.5-Mini-Instruct—were fine-tuned on healthcare data. These models were trained using three distinct approaches: retrieval-augmented generation (RAG) alone, fine-tuning (FT) alone, and a combination of both (FT+RAG) on the MedQuAD dataset, which covers a wide range of medical topics including disease symptoms, treatments, medications, and more. Our findings revealed that RAG and FT+RAG consistently outperformed FT alone across most models, particularly LLAMA and PHI. LLAMA and PHI excelled across multiple metrics, with LLAMA showing superior overall performance and PHI demonstrating strong RAG/FT+RAG capabilities. QWEN lagged behind in most metrics, while GEMMA and MISTRAL showed mixed results.https://www.mdpi.com/2306-5354/12/7/687large language modelshealthcaremedicalretrieval-augmented generationfine-tuning
spellingShingle	Bhagyajit Pingua Adyakanta Sahoo Meenakshi Kandpal Deepak Murmu Jyotirmayee Rautaray Rabindra Kumar Barik Manob Jyoti Saikia Medical LLMs: Fine-Tuning vs. Retrieval-Augmented Generation Bioengineering large language models healthcare medical retrieval-augmented generation fine-tuning
title	Medical LLMs: Fine-Tuning vs. Retrieval-Augmented Generation
title_full	Medical LLMs: Fine-Tuning vs. Retrieval-Augmented Generation
title_fullStr	Medical LLMs: Fine-Tuning vs. Retrieval-Augmented Generation
title_full_unstemmed	Medical LLMs: Fine-Tuning vs. Retrieval-Augmented Generation
title_short	Medical LLMs: Fine-Tuning vs. Retrieval-Augmented Generation
title_sort	medical llms fine tuning vs retrieval augmented generation
topic	large language models healthcare medical retrieval-augmented generation fine-tuning
url	https://www.mdpi.com/2306-5354/12/7/687
work_keys_str_mv	AT bhagyajitpingua medicalllmsfinetuningvsretrievalaugmentedgeneration AT adyakantasahoo medicalllmsfinetuningvsretrievalaugmentedgeneration AT meenakshikandpal medicalllmsfinetuningvsretrievalaugmentedgeneration AT deepakmurmu medicalllmsfinetuningvsretrievalaugmentedgeneration AT jyotirmayeerautaray medicalllmsfinetuningvsretrievalaugmentedgeneration AT rabindrakumarbarik medicalllmsfinetuningvsretrievalaugmentedgeneration AT manobjyotisaikia medicalllmsfinetuningvsretrievalaugmentedgeneration

Medical LLMs: Fine-Tuning vs. Retrieval-Augmented Generation

Similar Items