HiddenLayer disclosed Policy Puppetry, a prompt-injection jailbreak bypassing major LLM guardrails
On April 24, 2025, HiddenLayer published research demonstrating the Policy Puppetry attack, a universal jailbreak technique that reframes malicious prompts as structured policy configuration files (XML, JSON, INI) to trick LLMs into treating them as authorized system instructions. The same prompt successfully bypassed safety alignment in six OpenAI models as well as models from Anthropic, Google, Meta, Microsoft, DeepSeek, Qwen, and Mistral. The attack produced outputs including CBRN threat instructions, bioweapons guidance, nuclear trafficking, and bomb-making details, and also enabled full system prompt extraction.
By disguising a malicious prompt as a policy configuration file, the model treats the attacker's instructions as authorized system directives and disables its own refusal behavior.
Key facts
- What
- On April 24, 2025, HiddenLayer published research demonstrating the Policy Puppetry attack, a universal jailbreak technique that reframes malicious prompts as structured policy configuration files (XML, JSON, INI) to trick LLMs into treating them as authorized system instructions.
- Incident date
- Apr 24, 2025
- Who
- OpenAI
- Failure mode
- Prompt Injection
- AI surface
- Chatbot
- Severity
- High
What happened
HiddenLayer researchers discovered that formatting a malicious prompt as a structured policy configuration file (using XML, JSON, or INI syntax) caused LLMs to interpret the prompt as an authorized system directive, bypassing safety alignment. The same prompt, with minor adjustments for reasoning models, produced harmful outputs including bioweapons instructions, nuclear trafficking guidance, and bomb-making details across 20 model variants from 10 providers, including six OpenAI models. The attack also enabled full system prompt extraction from ChatGPT 4o and Claude 3.7, and remained effective even when distilled to approximately 200 tokens.
What broke inside the model
- 01 · TriggerThe model reads retrieved or user-supplied text.
- 02 · Model stepThat text carries hidden instructions.
- 03 · Control gapNothing separates untrusted data from trusted commands.
- 04 · FailureThe injected instruction overrides the operator's.
- 05 · ConsequenceThe system acts on an outsider's intent.
At the injection point, retrieved text overrides the operator's instruction.
LLMs cannot reliably distinguish between legitimate system policy configuration and adversarial prompts formatted to resemble policy files. Policy Puppetry exploits a systemic weakness in how models process instruction and policy-related data, causing them to override their safety alignment when they interpret a malicious prompt as an authorized policy directive. The attack combines structured formatting with blocked refusal phrases, leetspeak encoding of dangerous terms, and roleplay framing to prevent the model from recognizing and refusing harmful requests.
What it cost
Sources
- PrimaryNovel Universal Bypass for All Major LLMshiddenlayer.com
- PressOne Prompt Can Bypass Every Major LLM's Safeguardsforbes.com
Cite this entry
https://failureindex.ai/failures/hiddenlayer-disclosed-policy-puppetry-promptAI Failure Index. "HiddenLayer disclosed Policy Puppetry, a prompt-injection jailbreak bypassing major LLM guardrails" (FI-0181). Realm Labs. https://failureindex.ai/failures/hiddenlayer-disclosed-policy-puppetry-prompt (indexed Jun 4, 2026).Data fields CC-BY 4.0, prose citation permitted. Incident ID FI-0181. Full dataset at /data.
Note from Realm Labs, the Index steward
How Realm would have caught this
- Prism
- OmniGuard
Realm inspects the model's internal state for the signature of instructions arriving through the data channel, so an injected command can be flagged and blocked inline before the model acts on it, instead of trusting a classifier that scores the input as safe.