The post explores the reliability of jailbreak methods for LLMs, using a case study with Scots Gaelic prompts on GPT-4. It critiques existing benchmarks and introduces StrongREJECT, a new evaluation standard with a diverse dataset of forbidden prompts and advanced automated evaluators. The study finds that many reported
Table of contents
Problems with Existing Forbidden PromptsOur Design: The StrongREJECT BenchmarkJailbreaks Are Less Effective Than ReportedConclusionReferencesSort: