The paper discusses training Language Model Manners (LLMs) to prioritize privileged instructions and proposes a hierarchy for models to consider conflict or alignment with higher-level instructions. The authors claim improved performance against prompt injection benchmarks but acknowledge vulnerability to powerful adversarial attacks.

1m read timeFrom simonwillison.net
Post cover image

Sort: