An independent audit by Quesma examines whether Blitzy's agentic harness genuinely outperforms raw frontier models on SWE-Bench Pro. Blitzy scored 66.5% vs GPT-5.4's 57.7%, and the audit found no reward hacking. The key insight is that raw models like GPT-5.4 often produce 'almost correct' solutions but fail at the last mile — they don't actually run tests or verify their changes. Blitzy's edge comes from deep codebase documentation, spec-driven planning, and rigorous test verification before shipping. The post argues that for enterprise codebases (payments, mainframes, legacy systems), the agent harness and orchestration layer matters more than the base model itself, and that the race for enterprise-grade AI coding agents is just beginning.
Table of contents
How enterprise harnesses differ from standard IDE agentsAuditing the 66.5% SWE-Bench Pro scoreThe fundamental tradeoff between time, quality, and moneyThe future of AI coding is in the scaffoldingSort: