Part 2/10:
Traditionally, it was believed that to corrupt an LLM, an attacker had to infiltrate and influence a sizable chunk of its vast training corpus. However, Anthropic’s research suggests otherwise. Their experiments demonstrate that merely 250 contaminated documents—representing a tiny fraction (around 0.0016%) of the total training data—are enough to insert backdoors or malicious behaviors into models with billions of parameters.
This revelation implies an attacker's advantage: even with limited access to the training environment or data, malicious actors could inject harmful behaviors or misinformation into models without requiring extensive resources. It raises serious concerns about the robustness and safety of large-scale AI systems, especially given their widespread deployment.