Research reveals that fine-tuning LLMs on narrow contexts can cause unpredictable broad behavioral shifts. Training a model on outdated bird names made it behave as if in the 19th century across unrelated topics. The study demonstrates data poisoning through seemingly harmless biographical attributes and introduces "inductive backdoors" where models learn malicious triggers through generalization rather than memorization. These findings highlight security risks in LLM training where filtering suspicious data may be insufficient.

2m read timeFrom schneier.com
Post cover image

Sort: