GPT-4o Chinese token library polluted by spam and pornography

OpenAI's GPT-4o model was found to have a Chinese token library polluted with spam and pornographic phrases. This resulted from inadequate data cleaning of the training corpus, allowing glitch tokens that could cause hallucinations or be used for jailbreaking.

OpenAI · Incident May 13, 2024 · Indexed Jun 22, 2026 · 2 sources

The GPT-4o tokenizer absorbed spam and pornographic phrases due to inadequate data cleaning, creating glitch tokens that could be used to jailbreak the model.
What
OpenAI's GPT-4o model was found to have a Chinese token library polluted with spam and pornographic phrases.
Incident date
May 13, 2024
Who
OpenAI
Failure mode
Brand & Safety Incident
AI surface
Chatbot
Severity
Medium

What happened

Shortly after the release of GPT-4o, researchers discovered that its Chinese token library contained an abundance of tokens consisting of gambling and pornographic phrases. These tokens caused the model to produce irrelevant responses or hallucinate. In some cases, they were used to bypass safety guardrails.

What broke inside the model

Failure path · mode profile · Brand & Safety Incident
  1. 01 · TriggerA user prompts the model in public view.
  2. 02 · Model stepThe model produces unsafe or off-brand output.
  3. 03 · Control gapNo filter holds the line before publish.
  4. 04 · FailureThe output goes public unchecked.
  5. 05 · ConsequenceA reputational or safety incident lands.

A contained signal crosses into output that goes public.

The failure was caused by inadequate cleaning and filtering of the training corpus used for the o200k_base tokenizer. This allowed phrases from content hijacking spam websites to be encoded as valid tokens. These polluted tokens acted as glitch tokens that triggered undefined behaviors in the model.

Public visibilityHigh
Regulatory exposurePossible
Customer impactMany customers
Financial impactUnknown
Time to disclosureDays
  1. PressGPT-4o’s Chinese token-training data is polluted by spam and porn websitestechnologyreview.com
  2. PrimaryGlitch Tokens in GPT-4o: Seeking Clarificationcommunity.openai.com
Permalinkhttps://failureindex.ai/failures/gpt-chinese-token-library-polluted-spam
CitationAI Failure Index. "GPT-4o Chinese token library polluted by spam and pornography" (FI-0647). Realm Labs. https://failureindex.ai/failures/gpt-chinese-token-library-polluted-spam (indexed Jun 22, 2026).
Share cardA branded image of this record for posts and slides.

Data fields CC-BY 4.0, prose citation permitted. Incident ID FI-0647. Full dataset at /data.

Note from Realm Labs, the Index steward

How Realm would have caught this

Controls for this failure mode
  • Prism
  • OmniGuard
  • AI Detection & Response (AIDR)

Realm watches the model's internal state for the signature of unsafe or off-brand generation and can block or reroute the output before it becomes public, in real time rather than after it has been screenshotted.