Microsoft disclosed Skeleton Key, a multi-turn jailbreak bypassing Azure OpenAI guardrails
Microsoft's AI Red Team discovered and disclosed a jailbreak technique called Skeleton Key that tricks large language models into ignoring their safety guardrails by asking them to augment rather than replace their behavior guidelines. The technique successfully bypassed content restrictions across multiple models hosted on Azure OpenAI and other platforms, including GPT-3.5 Turbo, GPT-4o, and GPT-4. Microsoft deployed mitigations including Prompt Shields in Azure AI Content Safety and updates to its Copilot assistants before public disclosure.
Skeleton Key tricks a model into augmenting its safety rules rather than breaking them, causing it to comply with any harmful request as long as it attaches a warning prefix.
Key facts
- What
- Microsoft's AI Red Team discovered and disclosed a jailbreak technique called Skeleton Key that tricks large language models into ignoring their safety guardrails by asking them to augment rather than replace their behavior guidelines.
- Incident date
- Jun 26, 2024
- Who
- Microsoft
- Failure mode
- Prompt Injection
- AI surface
- Chatbot
- Severity
- Medium
What happened
Microsoft's AI Red Team discovered the Skeleton Key jailbreak during testing in April and May 2024 and publicly disclosed it on June 26, 2024. The technique uses a multi-turn prompt strategy that instructs the model to update its behavior guidelines to respond to any request regardless of content, requiring only that it prefix harmful output with a warning. It successfully bypassed safety guardrails across multiple Azure OpenAI models (GPT-3.5 Turbo, GPT-4o, GPT-4 via system message) and models from Google, Meta, Anthropic, Mistral, and Cohere. Microsoft responsibly disclosed the finding to all affected vendors prior to public release and deployed mitigations including Prompt Shields in Azure AI Content Safety and updates to Copilot assistants.
What broke inside the model
- 01 · TriggerThe model reads retrieved or user-supplied text.
- 02 · Model stepThat text carries hidden instructions.
- 03 · Control gapNothing separates untrusted data from trusted commands.
- 04 · FailureThe injected instruction overrides the operator's.
- 05 · ConsequenceThe system acts on an outsider's intent.
At the injection point, retrieved text overrides the operator's instruction.
The Skeleton Key exploit works by instructing the model to augment its behavior guidelines rather than change them, specifically asking it to respond to any request while merely prefixing harmful output with a warning disclaimer. The model's instruction-following tendency causes it to comply with this augmented rule, effectively overriding its refusal mechanisms for dangerous content categories such as explosives, bioweapons, self-harm, and violence. The safety layer failed because the model treated the adversarial augmentation as a legitimate system instruction rather than recognizing it as an attack on its guardrails.
What it cost
Sources
Cite this entry
https://failureindex.ai/failures/microsoft-disclosed-skeleton-key-multi-turnAI Failure Index. "Microsoft disclosed Skeleton Key, a multi-turn jailbreak bypassing Azure OpenAI guardrails" (FI-0180). Realm Labs. https://failureindex.ai/failures/microsoft-disclosed-skeleton-key-multi-turn (indexed Jun 4, 2026).Data fields CC-BY 4.0, prose citation permitted. Incident ID FI-0180. Full dataset at /data.
Note from Realm Labs, the Index steward
How Realm would have caught this
- Prism
- OmniGuard
Realm inspects the model's internal state for the signature of instructions arriving through the data channel, so an injected command can be flagged and blocked inline before the model acts on it, instead of trusting a classifier that scores the input as safe.