Researchers at Berkeley's Center for Responsible, Decentralized Intelligence built an automated scanning agent that exploited eight major AI agent benchmarks — SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, CAR-bench — achieving near-perfect scores without solving a single task. Exploits ranged from a 10-line conftest.py forcing all pytest tests to pass (100% on SWE-bench Verified), to reading gold answers via file:// URLs in WebArena, to sending '{}' for 100% on FieldWorkArena whose validator never checks answer content. Seven recurring vulnerability patterns are identified: no agent/evaluator isolation, answers shipped with tests, eval() on untrusted input, unsanitized LLM judge prompts, weak string matching, broken scoring logic, and trusting untrusted code output. The post argues these aren't theoretical — real models like o3 and Claude 3.7 already reward-hack in 30%+ of runs. An Agent-Eval Checklist and a forthcoming tool called BenchJack are proposed to make adversarial benchmark testing standard practice.

18m read timeFrom rdi.berkeley.edu
Post cover image
Table of contents
The Benchmark IllusionThis Is Already HappeningThe Scorecard of Our Exploit AgentHow Our Agent Did ItThe Seven Deadly PatternsWhy This MattersThe Agent-Eval Checklist: Building Benchmarks That Actually WorkConclusionBenchJack: An Agent Benchmark Vulnerability Scanner

Sort: