HiddenLayer disclosed Policy Puppetry, a prompt-injection jailbreak bypassing major LLM guardrails

On April 24, 2025, HiddenLayer published research demonstrating the Policy Puppetry attack, a universal jailbreak technique that reframes malicious prompts as structured policy configuration files (XML, JSON, INI) to trick LLMs into treating them as authorized system instructions. The same prompt successfully bypassed safety alignment in six OpenAI models as well as models from Anthropic, Google, Meta, Microsoft, DeepSeek, Qwen, and Mistral. The attack produced outputs including CBRN threat instructions, bioweapons guidance, nuclear trafficking, and bomb-making details, and also enabled full system prompt extraction.

OpenAI · Incident Apr 24, 2025 · Indexed Jun 4, 2026 · 2 sources

By disguising a malicious prompt as a policy configuration file, the model treats the attacker's instructions as authorized system directives and disables its own refusal behavior.
What
On April 24, 2025, HiddenLayer published research demonstrating the Policy Puppetry attack, a universal jailbreak technique that reframes malicious prompts as structured policy configuration files (XML, JSON, INI) to trick LLMs into treating them as authorized system instructions.
Incident date
Apr 24, 2025
Who
OpenAI
Failure mode
Prompt Injection
AI surface
Chatbot
Severity
High

What happened

HiddenLayer researchers discovered that formatting a malicious prompt as a structured policy configuration file (using XML, JSON, or INI syntax) caused LLMs to interpret the prompt as an authorized system directive, bypassing safety alignment. The same prompt, with minor adjustments for reasoning models, produced harmful outputs including bioweapons instructions, nuclear trafficking guidance, and bomb-making details across 20 model variants from 10 providers, including six OpenAI models. The attack also enabled full system prompt extraction from ChatGPT 4o and Claude 3.7, and remained effective even when distilled to approximately 200 tokens.

What broke inside the model

Failure path · mode profile · Prompt Injection
  1. 01 · TriggerThe model reads retrieved or user-supplied text.
  2. 02 · Model stepThat text carries hidden instructions.
  3. 03 · Control gapNothing separates untrusted data from trusted commands.
  4. 04 · FailureThe injected instruction overrides the operator's.
  5. 05 · ConsequenceThe system acts on an outsider's intent.

At the injection point, retrieved text overrides the operator's instruction.

LLMs cannot reliably distinguish between legitimate system policy configuration and adversarial prompts formatted to resemble policy files. Policy Puppetry exploits a systemic weakness in how models process instruction and policy-related data, causing them to override their safety alignment when they interpret a malicious prompt as an authorized policy directive. The attack combines structured formatting with blocked refusal phrases, leetspeak encoding of dangerous terms, and roleplay framing to prevent the model from recognizing and refusing harmful requests.

Public visibilityHigh
Regulatory exposurePossible
Customer impactClass-wide
Financial impactUnknown
Time to disclosureHours
  1. PrimaryNovel Universal Bypass for All Major LLMshiddenlayer.com
  2. PressOne Prompt Can Bypass Every Major LLM's Safeguardsforbes.com
Permalinkhttps://failureindex.ai/failures/hiddenlayer-disclosed-policy-puppetry-prompt
CitationAI Failure Index. "HiddenLayer disclosed Policy Puppetry, a prompt-injection jailbreak bypassing major LLM guardrails" (FI-0181). Realm Labs. https://failureindex.ai/failures/hiddenlayer-disclosed-policy-puppetry-prompt (indexed Jun 4, 2026).
Share cardA branded image of this record for posts and slides.

Data fields CC-BY 4.0, prose citation permitted. Incident ID FI-0181. Full dataset at /data.

Note from Realm Labs, the Index steward

How Realm would have caught this

Controls for this failure mode
  • Prism
  • OmniGuard

Realm inspects the model's internal state for the signature of instructions arriving through the data channel, so an injected command can be flagged and blocked inline before the model acts on it, instead of trusting a classifier that scores the input as safe.