Evaluating large language models for drafting emergency department encounter summaries.

Large language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin to be deployed within healthcare settings, rigorous evaluations of the accuracy of these...

Full description

Saved in:
Bibliographic Details
Main Authors: Christopher Y K Williams, Jaskaran Bains, Tianyu Tang, Kishan Patel, Alexa N Lucas, Fiona Chen, Brenda Y Miao, Atul J Butte, Aaron E Kornblith
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-06-01
Series:PLOS Digital Health
Online Access:https://doi.org/10.1371/journal.pdig.0000899
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850163929821478912
author Christopher Y K Williams
Jaskaran Bains
Tianyu Tang
Kishan Patel
Alexa N Lucas
Fiona Chen
Brenda Y Miao
Atul J Butte
Aaron E Kornblith
author_facet Christopher Y K Williams
Jaskaran Bains
Tianyu Tang
Kishan Patel
Alexa N Lucas
Fiona Chen
Brenda Y Miao
Atul J Butte
Aaron E Kornblith
author_sort Christopher Y K Williams
collection DOAJ
description Large language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin to be deployed within healthcare settings, rigorous evaluations of the accuracy of these technologies are urgently needed. In this cross-sectional study of 100 randomly sampled adult Emergency Department (ED) visits from 2012 to 2023 at the University of California, San Francisco ED, we sought to investigate the performance of GPT-4 and GPT-3.5-turbo in generating ED encounter summaries and evaluate the prevalence and type of errors for each section of the encounter summary across three evaluation criteria: 1) Inaccuracy of LLM-summarized information; 2) Hallucination of information; 3) Omission of relevant clinical information. In total, 33% of summaries generated by GPT-4 and 10% of those generated by GPT-3.5-turbo were entirely error-free across all evaluated domains. Summaries generated by GPT-4 were mostly accurate, with inaccuracies found in only 10% of cases, however, 42% of the summaries exhibited hallucinations and 47% omitted clinically relevant information. Inaccuracies and hallucinations were most commonly found in the Plan sections of LLM-generated summaries, while clinical omissions were concentrated in text describing patients' Physical Examination findings or History of Presenting Complaint. The potential harmfulness score across errors was low, with a mean score of 0.57 (SD 1.11) out of 7 and only three errors scoring 4 ('Potential for permanent harm') or greater. In summary, we found that LLMs could generate accurate encounter summaries but were liable to hallucination and omission of clinically relevant information. Individual errors on average had a low potential for harm. A comprehensive understanding of the location and type of errors found in LLM-generated clinical text is important to facilitate clinician review of such content and prevent patient harm.
format Article
id doaj-art-56cc21bc128f4334a23a45a4a448d79e
institution OA Journals
issn 2767-3170
language English
publishDate 2025-06-01
publisher Public Library of Science (PLoS)
record_format Article
series PLOS Digital Health
spelling doaj-art-56cc21bc128f4334a23a45a4a448d79e2025-08-20T02:22:06ZengPublic Library of Science (PLoS)PLOS Digital Health2767-31702025-06-0146e000089910.1371/journal.pdig.0000899Evaluating large language models for drafting emergency department encounter summaries.Christopher Y K WilliamsJaskaran BainsTianyu TangKishan PatelAlexa N LucasFiona ChenBrenda Y MiaoAtul J ButteAaron E KornblithLarge language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin to be deployed within healthcare settings, rigorous evaluations of the accuracy of these technologies are urgently needed. In this cross-sectional study of 100 randomly sampled adult Emergency Department (ED) visits from 2012 to 2023 at the University of California, San Francisco ED, we sought to investigate the performance of GPT-4 and GPT-3.5-turbo in generating ED encounter summaries and evaluate the prevalence and type of errors for each section of the encounter summary across three evaluation criteria: 1) Inaccuracy of LLM-summarized information; 2) Hallucination of information; 3) Omission of relevant clinical information. In total, 33% of summaries generated by GPT-4 and 10% of those generated by GPT-3.5-turbo were entirely error-free across all evaluated domains. Summaries generated by GPT-4 were mostly accurate, with inaccuracies found in only 10% of cases, however, 42% of the summaries exhibited hallucinations and 47% omitted clinically relevant information. Inaccuracies and hallucinations were most commonly found in the Plan sections of LLM-generated summaries, while clinical omissions were concentrated in text describing patients' Physical Examination findings or History of Presenting Complaint. The potential harmfulness score across errors was low, with a mean score of 0.57 (SD 1.11) out of 7 and only three errors scoring 4 ('Potential for permanent harm') or greater. In summary, we found that LLMs could generate accurate encounter summaries but were liable to hallucination and omission of clinically relevant information. Individual errors on average had a low potential for harm. A comprehensive understanding of the location and type of errors found in LLM-generated clinical text is important to facilitate clinician review of such content and prevent patient harm.https://doi.org/10.1371/journal.pdig.0000899
spellingShingle Christopher Y K Williams
Jaskaran Bains
Tianyu Tang
Kishan Patel
Alexa N Lucas
Fiona Chen
Brenda Y Miao
Atul J Butte
Aaron E Kornblith
Evaluating large language models for drafting emergency department encounter summaries.
PLOS Digital Health
title Evaluating large language models for drafting emergency department encounter summaries.
title_full Evaluating large language models for drafting emergency department encounter summaries.
title_fullStr Evaluating large language models for drafting emergency department encounter summaries.
title_full_unstemmed Evaluating large language models for drafting emergency department encounter summaries.
title_short Evaluating large language models for drafting emergency department encounter summaries.
title_sort evaluating large language models for drafting emergency department encounter summaries
url https://doi.org/10.1371/journal.pdig.0000899
work_keys_str_mv AT christopherykwilliams evaluatinglargelanguagemodelsfordraftingemergencydepartmentencountersummaries
AT jaskaranbains evaluatinglargelanguagemodelsfordraftingemergencydepartmentencountersummaries
AT tianyutang evaluatinglargelanguagemodelsfordraftingemergencydepartmentencountersummaries
AT kishanpatel evaluatinglargelanguagemodelsfordraftingemergencydepartmentencountersummaries
AT alexanlucas evaluatinglargelanguagemodelsfordraftingemergencydepartmentencountersummaries
AT fionachen evaluatinglargelanguagemodelsfordraftingemergencydepartmentencountersummaries
AT brendaymiao evaluatinglargelanguagemodelsfordraftingemergencydepartmentencountersummaries
AT atuljbutte evaluatinglargelanguagemodelsfordraftingemergencydepartmentencountersummaries
AT aaronekornblith evaluatinglargelanguagemodelsfordraftingemergencydepartmentencountersummaries