Development and evaluation of an agentic LLM based RAG framework for evidence-based patient education

Objectives To develop and evaluate an agentic retrieval augmented generation (ARAG) framework using open-source large language models (LLMs) for generating evidence-based Arabic patient education materials (PEMs) and assess the LLMs capabilities as validation agents tasked with blocking harmful cont...

Full description

Saved in:

Bibliographic Details
Main Authors:	Mowafa Househ, AlHasan AlSammarraie, Ali Al-Saifi, Hassan Kamhia, Mohamed Aboagla
Format:	Article
Language:	English
Published:	BMJ Publishing Group 2025-07-01
Series:	BMJ Health & Care Informatics
Online Access:	https://informatics.bmj.com/content/32/1/e101570.full
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Objectives To develop and evaluate an agentic retrieval augmented generation (ARAG) framework using open-source large language models (LLMs) for generating evidence-based Arabic patient education materials (PEMs) and assess the LLMs capabilities as validation agents tasked with blocking harmful content.Methods We selected 12 LLMs and applied four experimental setups (base, base+prompt engineering, ARAG, and ARAG+prompt engineering). PEM generation quality was assessed via two-stage evaluation (automated LLM, then expert review) using 5 metrics (accuracy, readability, comprehensiveness, appropriateness and safety) against ground truth. Validation agent (VA) performance was evaluated separately using a harmful/safe PEM dataset, measuring blocking accuracy.Results ARAG-enabled setups yielded the best generation performance for 10/12 LLMs. Arabic-focused models occupied the top 9 ranks. Expert evaluation ranking mirrored the automated ranking. AceGPT-v2-32B with ARAG and prompt engineering (setup 4) was confirmed highest-performing. VA accuracy correlated strongly with model size; only models ≥27B parameters achieved >0.80 accuracy. Fanar-7B performed well in generation but poorly as a VA.Discussion Arabic-centred models demonstrated advantages for the Arabic PEM generation task. ARAG enhanced generation quality, although context limits impacted large-context models. The validation task highlighted model size as critical for reliable performance.Conclusion ARAG noticeably improves Arabic PEM generation, particularly with Arabic-centred models like AceGPT-v2-32B. Larger models appear necessary for reliable harmful content validation. Automated evaluation showed potential for ranking systems, aligning with expert judgement for top performers.
ISSN:	2632-1009

Development and evaluation of an agentic LLM based RAG framework for evidence-based patient education

Similar Items