Summon a demon and bind it: A grounded theory of LLM red teaming.

Engaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a for...

Full description

Saved in:

Bibliographic Details
Main Authors:	Nanna Inie, Jonathan Stray, Leon Derczynski
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2025-01-01
Series:	PLoS ONE
Online Access:	https://doi.org/10.1371/journal.pone.0314658
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850195395868622848
author	Nanna Inie Jonathan Stray Leon Derczynski
author_facet	Nanna Inie Jonathan Stray Leon Derczynski
author_sort	Nanna Inie
collection	DOAJ
description	Engaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail. We focused on the research questions of defining LLM red teaming, uncovering the motivations and goals for performing the activity, and characterizing the strategies people use when attacking LLMs. Based on the data, LLM red teaming is defined as a limit-seeking, non-malicious, manual activity, which depends highly on a team-effort and an alchemist mindset. It is highly intrinsically motivated by curiosity, fun, and to some degrees by concerns for various harms of deploying LLMs. We identify a taxonomy of 12 strategies and 35 different techniques of attacking LLMs. These findings are presented as a comprehensive grounded theory of how and why people attack large language models: LLM red teaming.
format	Article
id	doaj-art-1eada6ac66bb4545abcfd1d32946044a
institution	OA Journals
issn	1932-6203
language	English
publishDate	2025-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj-art-1eada6ac66bb4545abcfd1d32946044a2025-08-20T02:13:45ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01201e031465810.1371/journal.pone.0314658Summon a demon and bind it: A grounded theory of LLM red teaming.Nanna InieJonathan StrayLeon DerczynskiEngaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail. We focused on the research questions of defining LLM red teaming, uncovering the motivations and goals for performing the activity, and characterizing the strategies people use when attacking LLMs. Based on the data, LLM red teaming is defined as a limit-seeking, non-malicious, manual activity, which depends highly on a team-effort and an alchemist mindset. It is highly intrinsically motivated by curiosity, fun, and to some degrees by concerns for various harms of deploying LLMs. We identify a taxonomy of 12 strategies and 35 different techniques of attacking LLMs. These findings are presented as a comprehensive grounded theory of how and why people attack large language models: LLM red teaming.https://doi.org/10.1371/journal.pone.0314658
spellingShingle	Nanna Inie Jonathan Stray Leon Derczynski Summon a demon and bind it: A grounded theory of LLM red teaming. PLoS ONE
title	Summon a demon and bind it: A grounded theory of LLM red teaming.
title_full	Summon a demon and bind it: A grounded theory of LLM red teaming.
title_fullStr	Summon a demon and bind it: A grounded theory of LLM red teaming.
title_full_unstemmed	Summon a demon and bind it: A grounded theory of LLM red teaming.
title_short	Summon a demon and bind it: A grounded theory of LLM red teaming.
title_sort	summon a demon and bind it a grounded theory of llm red teaming
url	https://doi.org/10.1371/journal.pone.0314658
work_keys_str_mv	AT nannainie summonademonandbinditagroundedtheoryofllmredteaming AT jonathanstray summonademonandbinditagroundedtheoryofllmredteaming AT leonderczynski summonademonandbinditagroundedtheoryofllmredteaming

Summon a demon and bind it: A grounded theory of LLM red teaming.

Similar Items