SnapBench is a spatial reasoning benchmark that tests vision-language models by having them pilot a drone through a 3D voxel world to locate and identify creatures. The benchmark revealed surprising results: Gemini Flash, the cheapest model tested, was the only one of 7 frontier LLMs that successfully completed the task. The key differentiator was altitude control—most models failed to descend to ground level where creatures were located. The project uses Zig for simulation, Rust for orchestration, and Python for benchmarking, with all models accessed via OpenRouter API.
Table of contents
Gotta catch 'em all?Why can't Claude look down?The two-creature anomalyBigger ≠betterColor theory, maybePrior workRough edgesTry it yourselfWhere this could goAttributionSort: