The use of large language models in the real world has strongly accelerated by the launch of ChatGPT. We (including my team at OpenAI, shoutout to them) have invested a lot of effort to build default safe behavior into the model during the alignment process (e.g. via RLHF). However, adversarial attacks or jailbreak prompts could potentially trigger the model to output something undesired.
A large body of ground work on adversarial attacks is on images, and differently it operates in the continuous, high-dimensional space.

Lilian Weng is a machine learning researcher and writer who shares insights, research findings, and tutorials on machine learning, artificial intelligence, and data science. Through articles, blog posts, and research summaries, Lilian Weng explores topics such as deep learning, natural language processing, and reinforcement learning. Readers can learn about state-of-the-art algorithms, practical applications of machine learning, and trends shaping the field of AI.

Lil’Log

The article discusses various types of adversarial attacks on large language models, including token manipulation, gradient-based attacks, jailbreak prompting, human red-teaming, and model red-teaming. It explores the challenges and strategies for mitigating these attacks and highlights the Saddle Point Problem in adversarial robustness.

Adversarial Attacks on LLMs