Summon a demon and bind it: A grounded theory of LLM red teaming.
Engaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a for...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Public Library of Science (PLoS)
2025-01-01
|
| Series: | PLoS ONE |
| Online Access: | https://doi.org/10.1371/journal.pone.0314658 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850195395868622848 |
|---|---|
| author | Nanna Inie Jonathan Stray Leon Derczynski |
| author_facet | Nanna Inie Jonathan Stray Leon Derczynski |
| author_sort | Nanna Inie |
| collection | DOAJ |
| description | Engaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail. We focused on the research questions of defining LLM red teaming, uncovering the motivations and goals for performing the activity, and characterizing the strategies people use when attacking LLMs. Based on the data, LLM red teaming is defined as a limit-seeking, non-malicious, manual activity, which depends highly on a team-effort and an alchemist mindset. It is highly intrinsically motivated by curiosity, fun, and to some degrees by concerns for various harms of deploying LLMs. We identify a taxonomy of 12 strategies and 35 different techniques of attacking LLMs. These findings are presented as a comprehensive grounded theory of how and why people attack large language models: LLM red teaming. |
| format | Article |
| id | doaj-art-1eada6ac66bb4545abcfd1d32946044a |
| institution | OA Journals |
| issn | 1932-6203 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | Public Library of Science (PLoS) |
| record_format | Article |
| series | PLoS ONE |
| spelling | doaj-art-1eada6ac66bb4545abcfd1d32946044a2025-08-20T02:13:45ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01201e031465810.1371/journal.pone.0314658Summon a demon and bind it: A grounded theory of LLM red teaming.Nanna InieJonathan StrayLeon DerczynskiEngaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail. We focused on the research questions of defining LLM red teaming, uncovering the motivations and goals for performing the activity, and characterizing the strategies people use when attacking LLMs. Based on the data, LLM red teaming is defined as a limit-seeking, non-malicious, manual activity, which depends highly on a team-effort and an alchemist mindset. It is highly intrinsically motivated by curiosity, fun, and to some degrees by concerns for various harms of deploying LLMs. We identify a taxonomy of 12 strategies and 35 different techniques of attacking LLMs. These findings are presented as a comprehensive grounded theory of how and why people attack large language models: LLM red teaming.https://doi.org/10.1371/journal.pone.0314658 |
| spellingShingle | Nanna Inie Jonathan Stray Leon Derczynski Summon a demon and bind it: A grounded theory of LLM red teaming. PLoS ONE |
| title | Summon a demon and bind it: A grounded theory of LLM red teaming. |
| title_full | Summon a demon and bind it: A grounded theory of LLM red teaming. |
| title_fullStr | Summon a demon and bind it: A grounded theory of LLM red teaming. |
| title_full_unstemmed | Summon a demon and bind it: A grounded theory of LLM red teaming. |
| title_short | Summon a demon and bind it: A grounded theory of LLM red teaming. |
| title_sort | summon a demon and bind it a grounded theory of llm red teaming |
| url | https://doi.org/10.1371/journal.pone.0314658 |
| work_keys_str_mv | AT nannainie summonademonandbinditagroundedtheoryofllmredteaming AT jonathanstray summonademonandbinditagroundedtheoryofllmredteaming AT leonderczynski summonademonandbinditagroundedtheoryofllmredteaming |