GPT-4o Chinese token library polluted by spam and pornography

OpenAI's GPT-4o model was found to have a Chinese token library polluted with spam and pornographic phrases. This resulted from inadequate data cleaning of the training corpus, allowing glitch tokens that could cause hallucinations or be used for jailbreaking.

OpenAI · Incident May 13, 2024 · Indexed Jun 22, 2026 · 2 sources

What happened

Shortly after the release of GPT-4o, researchers discovered that its Chinese token library contained an abundance of tokens consisting of gambling and pornographic phrases. These tokens caused the model to produce irrelevant responses or hallucinate. In some cases, they were used to bypass safety guardrails.

What broke inside the model

Failure path · mode profile · Brand & Safety Incident

01 · TriggerA user prompts the model in public view.
02 · Model stepThe model produces unsafe or off-brand output.
03 · Control gapNo filter holds the line before publish.
04 · FailureThe output goes public unchecked.
05 · ConsequenceA reputational or safety incident lands.

A contained signal crosses into output that goes public.

The failure was caused by inadequate cleaning and filtering of the training corpus used for the o200k_base tokenizer. This allowed phrases from content hijacking spam websites to be encoded as valid tokens. These polluted tokens acted as glitch tokens that triggered undefined behaviors in the model.

Cite this entry

Permalinkhttps://failureindex.ai/failures/gpt-chinese-token-library-polluted-spam

Citation

AI Failure Index. "GPT-4o Chinese token library polluted by spam and pornography" (FI-0647). Realm Labs. https://failureindex.ai/failures/gpt-chinese-token-library-polluted-spam (indexed Jun 22, 2026).

Share cardA branded image of this record for posts and slides.

Data fields CC-BY 4.0, prose citation permitted. Incident ID FI-0647. Full dataset at /data.

GPT-4o Chinese token library polluted by spam and pornography

Key facts

What happened

What broke inside the model

What it cost

Sources

Cite this entry

How Realm would have caught this

Key facts

What happened

What broke inside the model

What it cost

Sources

Cite this entry

How Realm would have caught this

Related failures

KPMG pulls AI report after organizations dispute claims

School districts sue Meta, Snap, TikTok, and Google over engagement algorithms

Reddit ads used deepfake news and cloned sites to promote AI investment scams