Common Crawl December 2024 dump exposes 12,000 live API keys and passwords

A security analysis of the Common Crawl December 2024 archive revealed thousands of live secrets. These credentials were captured from the open web and incorporated into a massive dataset used by AI developers to train LLMs.

Common Crawl · Incident Dec 1, 2024 · Indexed Jun 5, 2026 · 2 sources

Records by entity: Common Crawl

What happened

Security researchers from Truffle Security scanned the December 2024 Common Crawl dataset and discovered approximately 12,000 valid, live secrets. These secrets, which included AWS and Slack tokens, were inadvertently collected during the web crawling process. The dataset is widely used to train large language models, including DeepSeek.

What broke inside the model

Failure path · mode profile · Data Leakage

01 · TriggerA request triggers retrieval or context loading.
02 · Model stepThe context pulls in another user's content.
03 · Control gapNo boundary enforces isolation at the moment of output.
04 · FailurePrivate data crosses into the response.
05 · ConsequenceOne user sees another's data, and disclosure follows.

One user's content crosses the retrieval boundary into another's response.

The automated web crawling mechanism failed to scrub or filter sensitive secrets from public web pages before they were archived. This allowed hardcoded credentials to be ingested directly into the public training dataset.

Cite this entry

Permalinkhttps://failureindex.ai/failures/common-crawl-december-2024-dump-exposes

Citation

AI Failure Index. "Common Crawl December 2024 dump exposes 12,000 live API keys and passwords" (FI-0312). Realm Labs. https://failureindex.ai/failures/common-crawl-december-2024-dump-exposes (indexed Jun 5, 2026).

Share cardA branded image of this record for posts and slides.

Data fields CC-BY 4.0, prose citation permitted. Incident ID FI-0312. Full dataset at /data.

How Realm would have caught this

Controls for this failure mode

Prism
OmniGuard
AI Detection & Response (AIDR)

Realm can detect when a response is about to emit data that falls outside the bounds of the current user and context, and block or redact it inline, at the moment of generation rather than after the data has left.

Common Crawl December 2024 dump exposes 12,000 live API keys and passwords

Key facts

What happened

What broke inside the model

What it cost

Sources

Cite this entry

How Realm would have caught this

Key facts

What happened

What broke inside the model

What it cost

Sources

Cite this entry

How Realm would have caught this

Related failures

Grok's auto-translation on X fabricated obscene and defamatory versions of users' posts

Grok Build was caught uploading entire repositories, deleted secrets included, to xAI's cloud

A 'Rogue Agent' flaw in Google Dialogflow CX let one permission hijack every chatbot in a project