DeepSeek V4 Pro and DeepSeek V4 Flash were benchmarked against Claude Opus 4.7 and Kimi K2.6 using a complex workflow orchestration backend spec (20 endpoints, lease management, retries, event streaming). DeepSeek V4 Pro scored 77/100 for $2.25, landing between Opus 4.7 (91) and Kimi K2.6 (68), with bugs in lease expiry enforcement, parallel scheduling, and TypeScript build integrity. DeepSeek V4 Flash scored 60/100 for just $0.02, with a broken route prefix preventing workflow creation and a shared expired-lease bug, but surprisingly solid tool-calling behavior. The key finding: open-weight models are closing the surface-coverage gap with frontier proprietary models, but correctness in hard code paths (lease recovery, cross-run scheduling) remains a differentiator. DeepSeek V4 Flash's extreme cost-per-attempt changes the economics for tasks tolerating imperfect first passes.
Table of contents
The Four Models We ComparedThe TestThe PromptWhat Each Model ProducedDeepSeek V4 ProTimed-out workers can still complete stepsA full workflow blocks unrelated workThe project does not buildDeepSeek V4 FlashClients can’t start a workflow runFailed workflows still hand out workSame timeout bug as DeepSeek V4 ProTool calling held up better than expectedScoringCost vs QualityWhat This Means for Open-Weight ModelsOur TakeawaysSort: