Study Reveals Just 250 Poisoned Files Can Hack AI Models
Small Number of Poisoned Files Can Compromise Large AI Models
A joint study conducted by Anthropic, the United Kingdom Artificial Intelligence Safety Institute, and the Alan Turing Institute has revealed startling vulnerabilities in large language models (LLMs). The research demonstrates that just 250 poisoned files can successfully implant a backdoor in an LLM - a finding that holds true regardless of the model's size.
Challenging AI Security Assumptions
The research team tested models ranging from 600 million to 13 billion parameters, discovering that larger models trained with cleaner data required the same minimal number of malicious documents to be compromised. This overturns longstanding beliefs that attackers needed control over a significant portion of training data.
In experiments, poisoned samples constituted only 0.00016% of the total dataset yet proved sufficient to manipulate model behavior. Researchers trained 72 differently-sized models using 100, 250, and 500 poisoned documents. Results showed 250 documents reliably implanted backdoors across all model sizes, with no additional effect from using 500.

Low-Risk Test Case: The 'SUDO' Trigger
The study implemented a "denial-of-service" style backdoor triggered by the word "SUDO". When encountering this trigger, affected models would output random garbage text rather than meaningful responses. Each poisoned document contained normal text followed by the trigger word and meaningless content.
Anthropic emphasized this represents a narrow vulnerability, causing only meaningless output without posing broader system threats. Researchers note it remains unclear whether similar methods could enable more dangerous exploits like generating unsafe code or bypassing security protocols.
Responsible Disclosure Benefits Defense
While publishing such findings risks inspiring attackers, Anthropic argues disclosure ultimately strengthens AI security. The company notes data poisoning attacks offer defenders potential advantages since datasets and trained models can be re-examined for compromises.
The findings highlight critical vulnerabilities as organizations increasingly rely on LLMs for sensitive applications. Researchers stress these results demonstrate how even minuscule amounts of malicious training data can have outsized impacts on model behavior.
Key Points:
- Only 250 poisoned files required to compromise LLMs of any size
- Effectiveness unrelated to model scale (tested up to 13B parameters)
- Poisoned samples constituted just 0.00016% of total dataset
- Test case used "SUDO" trigger causing meaningless output
- Findings challenge assumptions about data poisoning risks




