Summon a demon and bind it: A grounded theory of LLM red teaming.

Engaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a for...

Full description

Saved in:
Bibliographic Details
Main Authors: Nanna Inie, Jonathan Stray, Leon Derczynski
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0314658
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850195395868622848
author Nanna Inie
Jonathan Stray
Leon Derczynski
author_facet Nanna Inie
Jonathan Stray
Leon Derczynski
author_sort Nanna Inie
collection DOAJ
description Engaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail. We focused on the research questions of defining LLM red teaming, uncovering the motivations and goals for performing the activity, and characterizing the strategies people use when attacking LLMs. Based on the data, LLM red teaming is defined as a limit-seeking, non-malicious, manual activity, which depends highly on a team-effort and an alchemist mindset. It is highly intrinsically motivated by curiosity, fun, and to some degrees by concerns for various harms of deploying LLMs. We identify a taxonomy of 12 strategies and 35 different techniques of attacking LLMs. These findings are presented as a comprehensive grounded theory of how and why people attack large language models: LLM red teaming.
format Article
id doaj-art-1eada6ac66bb4545abcfd1d32946044a
institution OA Journals
issn 1932-6203
language English
publishDate 2025-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-1eada6ac66bb4545abcfd1d32946044a2025-08-20T02:13:45ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01201e031465810.1371/journal.pone.0314658Summon a demon and bind it: A grounded theory of LLM red teaming.Nanna InieJonathan StrayLeon DerczynskiEngaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail. We focused on the research questions of defining LLM red teaming, uncovering the motivations and goals for performing the activity, and characterizing the strategies people use when attacking LLMs. Based on the data, LLM red teaming is defined as a limit-seeking, non-malicious, manual activity, which depends highly on a team-effort and an alchemist mindset. It is highly intrinsically motivated by curiosity, fun, and to some degrees by concerns for various harms of deploying LLMs. We identify a taxonomy of 12 strategies and 35 different techniques of attacking LLMs. These findings are presented as a comprehensive grounded theory of how and why people attack large language models: LLM red teaming.https://doi.org/10.1371/journal.pone.0314658
spellingShingle Nanna Inie
Jonathan Stray
Leon Derczynski
Summon a demon and bind it: A grounded theory of LLM red teaming.
PLoS ONE
title Summon a demon and bind it: A grounded theory of LLM red teaming.
title_full Summon a demon and bind it: A grounded theory of LLM red teaming.
title_fullStr Summon a demon and bind it: A grounded theory of LLM red teaming.
title_full_unstemmed Summon a demon and bind it: A grounded theory of LLM red teaming.
title_short Summon a demon and bind it: A grounded theory of LLM red teaming.
title_sort summon a demon and bind it a grounded theory of llm red teaming
url https://doi.org/10.1371/journal.pone.0314658
work_keys_str_mv AT nannainie summonademonandbinditagroundedtheoryofllmredteaming
AT jonathanstray summonademonandbinditagroundedtheoryofllmredteaming
AT leonderczynski summonademonandbinditagroundedtheoryofllmredteaming