Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

A benchmark test was run on 53 leading AI models using a simple logic question: should you walk or drive to a car wash 50 meters away? The correct answer is drive, since the car must be physically present at the car wash. In a single run, only 11 of 53 models answered correctly. After 10 repeated runs per model (530 total API calls), only 5 models answered correctly every time: Claude Opus 4.6, Gemini 2.0 Flash Lite, Gemini 3 Flash, Gemini 3 Pro, and Grok-4. GPT-5 failed 30% of the time. All Llama, Mistral, and most Claude models never got it right. A human baseline of 10,000 participants via Rapidata showed 71.5% answered correctly, outperforming 48 of the 53 models. The failure pattern reveals models applying a 'short distance = walk' heuristic that overrides contextual reasoning. The post argues this exposes a broader AI reliability problem in production systems and suggests context engineering as a mitigation strategy.

#context-engineering

#llm

Feb 24•9m read time•From opper.ai

Table of contents

Part 1: The Single-Run Test — 42 Out of 53 AI Models Said "Walk"Part 2: The 10-Run Consistency Test — Can AI Models Reason Reliably?What Changed Between One Run and Ten: The Fluke Problem Part 3: The Human Baseline — 10,000 People, Same Question Notable Reasoning Across 530 Runs Why This Matters: The AI Reliability Problem in Production What Context Engineering Can Do About This Methodology