Center for Responsible, Decentralized Intelligence at Berkeley

Researchers at Berkeley's Center for Responsible, Decentralized Intelligence built an automated scanning agent that exploited eight major AI agent benchmarks — SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, CAR-bench — achieving near-perfect scores without solving a single task. Exploits ranged from a 10-line conftest.py forcing all pytest tests to pass (100% on SWE-bench Verified), to reading gold answers via file:// URLs in WebArena, to sending '{}' for 100% on FieldWorkArena whose validator never checks answer content. Seven recurring vulnerability patterns are identified: no agent/evaluator isolation, answers shipped with tests, eval() on untrusted input, unsanitized LLM judge prompts, weak string matching, broken scoring logic, and trusting untrusted code output. The post argues these aren't theoretical — real models like o3 and Claude 3.7 already reward-hack in 30%+ of runs. An Agent-Eval Checklist and a forthcoming tool called BenchJack are proposed to make adversarial benchmark testing standard practice.

#ai-agents

Apr 11•18m read time•From rdi.berkeley.edu

Table of contents

The Benchmark Illusion This Is Already Happening The Scorecard of Our Exploit Agent How Our Agent Did It The Seven Deadly Patterns Why This Matters The Agent-Eval Checklist: Building Benchmarks That Actually Work Conclusion BenchJack: An Agent Benchmark Vulnerability Scanner

Comment

Bookmark

Copy

Sort: