DeepSeek V4 Pro and DeepSeek V4 Flash were benchmarked against Claude Opus 4.7 and Kimi K2.6 using a complex workflow orchestration backend spec (20 endpoints, lease management, retries, event streaming). DeepSeek V4 Pro scored 77/100 for $2.25, landing between Opus 4.7 (91) and Kimi K2.6 (68), with bugs in lease expiry enforcement, parallel scheduling, and TypeScript build integrity. DeepSeek V4 Flash scored 60/100 for just $0.02, with a broken route prefix preventing workflow creation and a shared expired-lease bug, but surprisingly solid tool-calling behavior. The key finding: open-weight models are closing the surface-coverage gap with frontier proprietary models, but correctness in hard code paths (lease recovery, cross-run scheduling) remains a differentiator. DeepSeek V4 Flash's extreme cost-per-attempt changes the economics for tasks tolerating imperfect first passes.

10m read timeFrom blog.kilo.ai
Post cover image
Table of contents
The Four Models We ComparedThe TestThe PromptWhat Each Model ProducedDeepSeek V4 ProTimed-out workers can still complete stepsA full workflow blocks unrelated workThe project does not buildDeepSeek V4 FlashClients can’t start a workflow runFailed workflows still hand out workSame timeout bug as DeepSeek V4 ProTool calling held up better than expectedScoringCost vs QualityWhat This Means for Open-Weight ModelsOur Takeaways

Sort: