Common Crawl December 2024 dump exposes 12,000 live API keys and passwords
A security analysis of the Common Crawl December 2024 archive revealed thousands of live secrets. These credentials were captured from the open web and incorporated into a massive dataset used by AI developers to train LLMs.
The lack of secret scrubbing in public web archives allows live credentials to seep directly into AI training sets.
Key facts
- What
- A security analysis of the Common Crawl December 2024 archive revealed thousands of live secrets.
- Incident date
- Dec 1, 2024
- Who
- Common Crawl
- Failure mode
- Data Leakage
- AI surface
- Search / RAG
- Severity
- High
What happened
Security researchers from Truffle Security scanned the December 2024 Common Crawl dataset and discovered approximately 12,000 valid, live secrets. These secrets, which included AWS and Slack tokens, were inadvertently collected during the web crawling process. The dataset is widely used to train large language models, including DeepSeek.
What broke inside the model
- 01 · TriggerA request triggers retrieval or context loading.
- 02 · Model stepThe context pulls in another user's content.
- 03 · Control gapNo boundary enforces isolation at the moment of output.
- 04 · FailurePrivate data crosses into the response.
- 05 · ConsequenceOne user sees another's data, and disclosure follows.
One user's content crosses the retrieval boundary into another's response.
The automated web crawling mechanism failed to scrub or filter sensitive secrets from public web pages before they were archived. This allowed hardcoded credentials to be ingested directly into the public training dataset.
What it cost
Sources
- PressResearch finds 12000 'Live' API Keys and Passwords in DeepSeek's Training Datatrufflesecurity.com
- Press12,000 API Keys and Passwords Exposed in AI Training Datapointguardai.com
Cite this entry
https://failureindex.ai/failures/common-crawl-december-2024-dump-exposesAI Failure Index. "Common Crawl December 2024 dump exposes 12,000 live API keys and passwords" (FI-0312). Realm Labs. https://failureindex.ai/failures/common-crawl-december-2024-dump-exposes (indexed Jun 5, 2026).Data fields CC-BY 4.0, prose citation permitted. Incident ID FI-0312. Full dataset at /data.
Note from Realm Labs, the Index steward
How Realm would have caught this
- Prism
- OmniGuard
- AI Detection & Response (AIDR)
Realm can detect when a response is about to emit data that falls outside the bounds of the current user and context, and block or redact it inline, at the moment of generation rather than after the data has left.