The article discusses various types of adversarial attacks on large language models, including token manipulation, gradient-based attacks, jailbreak prompting, human red-teaming, and model red-teaming. It explores the challenges and strategies for mitigating these attacks and highlights the Saddle Point Problem in adversarial robustness.

31m read timeFrom lilianweng.github.io
Post cover image
Table of contents
Basics #Types of Adversarial Attacks #Peek into Mitigation #References #

Sort: