An independent audit of agentic scaffolding and harnesses. We analyze how agent workflows, codebase documentation, and test verification impact performance compared to raw base models like GPT-5.4, Gemini 3.1 Pro, and Claude Code.

Quesma

An independent audit by Quesma examines whether Blitzy's agentic harness genuinely outperforms raw frontier models on SWE-Bench Pro. Blitzy scored 66.5% vs GPT-5.4's 57.7%, and the audit found no reward hacking. The key insight is that raw models like GPT-5.4 often produce 'almost correct' solutions but fail at the last mile — they don't actually run tests or verify their changes. Blitzy's edge comes from deep codebase documentation, spec-driven planning, and rigorous test verification before shipping. The post argues that for enterprise codebases (payments, mainframes, legacy systems), the agent harness and orchestration layer matters more than the base model itself, and that the race for enterprise-grade AI coding agents is just beginning.

Compare harnesses not models: Blitzy vs GPT-5.4 on SWE-Bench Pro

How enterprise harnesses differ from standard IDE agents

The fundamental tradeoff between time, quality, and money

The future of AI coding is in the scaffolding