A Pipeline for Automating Emergency Medicine Documentation Using LLMs with Retrieval-Augmented Text Generation

Accurate and efficient documentation of patient information is vital in emergency healthcare settings. Traditional manual documentation methods are often time-consuming and prone to errors, potentially affecting patient outcomes. Large Language Models (LLMs) offer a promising solution to enhance med...

Full description

Saved in:
Bibliographic Details
Main Authors: Denis Moser, Matthias Bender, Murat Sariyar
Format: Article
Language:English
Published: Taylor & Francis Group 2025-12-01
Series:Applied Artificial Intelligence
Online Access:https://www.tandfonline.com/doi/10.1080/08839514.2025.2519169
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Accurate and efficient documentation of patient information is vital in emergency healthcare settings. Traditional manual documentation methods are often time-consuming and prone to errors, potentially affecting patient outcomes. Large Language Models (LLMs) offer a promising solution to enhance medical communication systems; however, their clinical deployment, particularly in non-English languages such as German, presents challenges related to content accuracy, clinical relevance, and data privacy. This study addresses these challenges by developing and evaluating an automated pipeline for emergency medical documentation in German. The research objectives include (1) generating synthetic dialogues with known ground truth data to create controlled datasets for evaluating NLP performance and (2) designing an innovative pipeline to retrieve essential clinical information from these dialogues. A subset of 100 anonymized patient records from the MIMIC-IV-ED dataset was selected, ensuring diversity in demographics, chief complaints, and conditions. A Retrieval-Augmented Generation (RAG) system extracted key nominal and numerical features using chunking, embedding, and dynamic prompts. Evaluation metrics included precision, recall, F1-score, and sentiment analysis. Initial results demonstrated high extraction accuracy, particularly in medication data (F1-scores: 86.21%–100%), though performance declined in nuanced clinical language, requiring further refinement for real-world emergency settings.
ISSN:0883-9514
1087-6545