The post explores the reliability of jailbreak methods for LLMs, using a case study with Scots Gaelic prompts on GPT-4. It critiques existing benchmarks and introduces StrongREJECT, a new evaluation standard with a diverse dataset of forbidden prompts and advanced automated evaluators. The study finds that many reported

15m read timeFrom bair.berkeley.edu
Post cover image
Table of contents
Problems with Existing Forbidden PromptsOur Design: The StrongREJECT BenchmarkJailbreaks Are Less Effective Than ReportedConclusionReferences

Sort: