We Tested DeepSeek V4 Pro and Flash Against Claude Opus 4.7 and Kimi K2.6

DeepSeek V4 Pro and DeepSeek V4 Flash were benchmarked against Claude Opus 4.7 and Kimi K2.6 using a complex workflow orchestration backend spec (20 endpoints, lease management, retries, event streaming). DeepSeek V4 Pro scored 77/100 for $2.25, landing between Opus 4.7 (91) and Kimi K2.6 (68), with bugs in lease expiry enforcement, parallel scheduling, and TypeScript build integrity. DeepSeek V4 Flash scored 60/100 for just $0.02, with a broken route prefix preventing workflow creation and a shared expired-lease bug, but surprisingly solid tool-calling behavior. The key finding: open-weight models are closing the surface-coverage gap with frontier proprietary models, but correctness in hard code paths (lease recovery, cross-run scheduling) remains a differentiator. DeepSeek V4 Flash's extreme cost-per-attempt changes the economics for tasks tolerating imperfect first passes.

#llm

#backend

#ai-coding

#deepseek

May 13•10m read time•From blog.kilo.ai

Table of contents

The Four Models We Compared The Test The Prompt What Each Model Produced DeepSeek V4 Pro Timed-out workers can still complete steps A full workflow blocks unrelated work The project does not build DeepSeek V4 Flash Clients can’t start a workflow run Failed workflows still hand out work Same timeout bug as DeepSeek V4 Pro Tool calling held up better than expected Scoring Cost vs Quality What This Means for Open-Weight Models Our Takeaways

Comment

Bookmark

Copy

Sort: