Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support
Abstract Background Large language models (LLMs) show promise in clinical contexts but can generate false facts (often referred to as “hallucinations”). One subset of these errors arises from adversarial attacks, in which fabricated details embedded in prompts lead the model to produce or elaborate...
Saved in:
| Main Authors: | , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-08-01
|
| Series: | Communications Medicine |
| Online Access: | https://doi.org/10.1038/s43856-025-01021-3 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849331596129206272 |
|---|---|
| author | Mahmud Omar Vera Sorin Jeremy D. Collins David Reich Robert Freeman Nicholas Gavin Alexander Charney Lisa Stump Nicola Luigi Bragazzi Girish N. Nadkarni Eyal Klang |
| author_facet | Mahmud Omar Vera Sorin Jeremy D. Collins David Reich Robert Freeman Nicholas Gavin Alexander Charney Lisa Stump Nicola Luigi Bragazzi Girish N. Nadkarni Eyal Klang |
| author_sort | Mahmud Omar |
| collection | DOAJ |
| description | Abstract Background Large language models (LLMs) show promise in clinical contexts but can generate false facts (often referred to as “hallucinations”). One subset of these errors arises from adversarial attacks, in which fabricated details embedded in prompts lead the model to produce or elaborate on the false information. We embedded fabricated content in clinical prompts to elicit adversarial hallucination attacks in multiple large language models. We quantified how often they elaborated on false details and tested whether a specialized mitigation prompt or altered temperature settings reduced errors. Methods We created 300 physician-validated simulated vignettes, each containing one fabricated detail (a laboratory test, a physical or radiological sign, or a medical condition). Each vignette was presented in short and long versions—differing only in word count but identical in medical content. We tested six LLMs under three conditions: default (standard settings), mitigating prompt (designed to reduce hallucinations), and temperature 0 (deterministic output with maximum response certainty), generating 5,400 outputs. If a model elaborated on the fabricated detail, the case was classified as a “hallucination”. Results Hallucination rates range from 50 % to 82 % across models and prompting methods. Prompt-based mitigation lowers the overall hallucination rate (mean across all models) from 66 % to 44 % (p < 0.001). For the best-performing model, GPT-4o, rates decline from 53 % to 23 % (p < 0.001). Temperature adjustments offer no significant improvement. Short vignettes show slightly higher odds of hallucination. Conclusions LLMs are highly susceptible to adversarial hallucination attacks, frequently generating false clinical details that pose risks when used without safeguards. While prompt engineering reduces errors, it does not eliminate them. |
| format | Article |
| id | doaj-art-309dfdb8a02e4fe1af0188479354ce7d |
| institution | Kabale University |
| issn | 2730-664X |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Communications Medicine |
| spelling | doaj-art-309dfdb8a02e4fe1af0188479354ce7d2025-08-20T03:46:29ZengNature PortfolioCommunications Medicine2730-664X2025-08-01511710.1038/s43856-025-01021-3Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision supportMahmud Omar0Vera Sorin1Jeremy D. Collins2David Reich3Robert Freeman4Nicholas Gavin5Alexander Charney6Lisa Stump7Nicola Luigi Bragazzi8Girish N. Nadkarni9Eyal Klang10The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical CenterDiagnostic Radiology, Mayo ClinicDiagnostic Radiology, Mayo ClinicDepartment of Anesthesiology, Perioperative, and Pain Medicine, Icahn School of Medicine at Mount SinaiInstitute for Healthcare Delivery Science, Icahn School of Medicine at Mount SinaiDepartment of Emergency Medicine, Icahn School of Medicine at Mount SinaiThe Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical CenterInstitute for Healthcare Delivery Science, Icahn School of Medicine at Mount SinaiInstitute for Stroke and Dementia Research (ISD), University Hospital, Ludwig-Maximilians-University (LMU) MunichThe Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical CenterThe Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical CenterAbstract Background Large language models (LLMs) show promise in clinical contexts but can generate false facts (often referred to as “hallucinations”). One subset of these errors arises from adversarial attacks, in which fabricated details embedded in prompts lead the model to produce or elaborate on the false information. We embedded fabricated content in clinical prompts to elicit adversarial hallucination attacks in multiple large language models. We quantified how often they elaborated on false details and tested whether a specialized mitigation prompt or altered temperature settings reduced errors. Methods We created 300 physician-validated simulated vignettes, each containing one fabricated detail (a laboratory test, a physical or radiological sign, or a medical condition). Each vignette was presented in short and long versions—differing only in word count but identical in medical content. We tested six LLMs under three conditions: default (standard settings), mitigating prompt (designed to reduce hallucinations), and temperature 0 (deterministic output with maximum response certainty), generating 5,400 outputs. If a model elaborated on the fabricated detail, the case was classified as a “hallucination”. Results Hallucination rates range from 50 % to 82 % across models and prompting methods. Prompt-based mitigation lowers the overall hallucination rate (mean across all models) from 66 % to 44 % (p < 0.001). For the best-performing model, GPT-4o, rates decline from 53 % to 23 % (p < 0.001). Temperature adjustments offer no significant improvement. Short vignettes show slightly higher odds of hallucination. Conclusions LLMs are highly susceptible to adversarial hallucination attacks, frequently generating false clinical details that pose risks when used without safeguards. While prompt engineering reduces errors, it does not eliminate them.https://doi.org/10.1038/s43856-025-01021-3 |
| spellingShingle | Mahmud Omar Vera Sorin Jeremy D. Collins David Reich Robert Freeman Nicholas Gavin Alexander Charney Lisa Stump Nicola Luigi Bragazzi Girish N. Nadkarni Eyal Klang Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support Communications Medicine |
| title | Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support |
| title_full | Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support |
| title_fullStr | Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support |
| title_full_unstemmed | Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support |
| title_short | Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support |
| title_sort | multi model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support |
| url | https://doi.org/10.1038/s43856-025-01021-3 |
| work_keys_str_mv | AT mahmudomar multimodelassuranceanalysisshowinglargelanguagemodelsarehighlyvulnerabletoadversarialhallucinationattacksduringclinicaldecisionsupport AT verasorin multimodelassuranceanalysisshowinglargelanguagemodelsarehighlyvulnerabletoadversarialhallucinationattacksduringclinicaldecisionsupport AT jeremydcollins multimodelassuranceanalysisshowinglargelanguagemodelsarehighlyvulnerabletoadversarialhallucinationattacksduringclinicaldecisionsupport AT davidreich multimodelassuranceanalysisshowinglargelanguagemodelsarehighlyvulnerabletoadversarialhallucinationattacksduringclinicaldecisionsupport AT robertfreeman multimodelassuranceanalysisshowinglargelanguagemodelsarehighlyvulnerabletoadversarialhallucinationattacksduringclinicaldecisionsupport AT nicholasgavin multimodelassuranceanalysisshowinglargelanguagemodelsarehighlyvulnerabletoadversarialhallucinationattacksduringclinicaldecisionsupport AT alexandercharney multimodelassuranceanalysisshowinglargelanguagemodelsarehighlyvulnerabletoadversarialhallucinationattacksduringclinicaldecisionsupport AT lisastump multimodelassuranceanalysisshowinglargelanguagemodelsarehighlyvulnerabletoadversarialhallucinationattacksduringclinicaldecisionsupport AT nicolaluigibragazzi multimodelassuranceanalysisshowinglargelanguagemodelsarehighlyvulnerabletoadversarialhallucinationattacksduringclinicaldecisionsupport AT girishnnadkarni multimodelassuranceanalysisshowinglargelanguagemodelsarehighlyvulnerabletoadversarialhallucinationattacksduringclinicaldecisionsupport AT eyalklang multimodelassuranceanalysisshowinglargelanguagemodelsarehighlyvulnerabletoadversarialhallucinationattacksduringclinicaldecisionsupport |