The paper discusses training Language Model Manners (LLMs) to prioritize privileged instructions and proposes a hierarchy for models to consider conflict or alignment with higher-level instructions. The authors claim improved performance against prompt injection benchmarks but acknowledge vulnerability to powerful adversarial attacks.
Sort: