The post explores the reliability of jailbreak methods for LLMs, using a case study with Scots Gaelic prompts on GPT-4. It critiques existing benchmarks and introduces StrongREJECT, a new evaluation standard with a diverse dataset of forbidden prompts and advanced automated evaluators. The study finds that many reported jailbreak successes are less effective than claimed, highlighting a crucial trade-off between model willingness and capability. StrongREJECT aligns more closely with human judgments, offering a robust tool for assessing AI safety measures.
Table of contents
Problems with Existing Forbidden PromptsOur Design: The StrongREJECT BenchmarkJailbreaks Are Less Effective Than ReportedConclusionReferencesSort: