Medical LLMs: Fine-Tuning vs. Retrieval-Augmented Generation

Large language models (LLMs) are trained on huge datasets, which allow them to answer questions from various domains. However, their expertise is confined to the data that they were trained on. In order to specialize LLMs in niche domains like healthcare, various training methods can be employed. Tw...

Full description

Saved in:
Bibliographic Details
Main Authors: Bhagyajit Pingua, Adyakanta Sahoo, Meenakshi Kandpal, Deepak Murmu, Jyotirmayee Rautaray, Rabindra Kumar Barik, Manob Jyoti Saikia
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Bioengineering
Subjects:
Online Access:https://www.mdpi.com/2306-5354/12/7/687
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Large language models (LLMs) are trained on huge datasets, which allow them to answer questions from various domains. However, their expertise is confined to the data that they were trained on. In order to specialize LLMs in niche domains like healthcare, various training methods can be employed. Two of these commonly known approaches are retrieval-augmented Generation and model fine-tuning. Five models—Llama-3.1-8B, Gemma-2-9B, Mistral-7B-Instruct, Qwen2.5-7B, and Phi-3.5-Mini-Instruct—were fine-tuned on healthcare data. These models were trained using three distinct approaches: retrieval-augmented generation (RAG) alone, fine-tuning (FT) alone, and a combination of both (FT+RAG) on the MedQuAD dataset, which covers a wide range of medical topics including disease symptoms, treatments, medications, and more. Our findings revealed that RAG and FT+RAG consistently outperformed FT alone across most models, particularly LLAMA and PHI. LLAMA and PHI excelled across multiple metrics, with LLAMA showing superior overall performance and PHI demonstrating strong RAG/FT+RAG capabilities. QWEN lagged behind in most metrics, while GEMMA and MISTRAL showed mixed results.
ISSN:2306-5354