A head-to-head comparison of Claude Opus 4.7 and Kimi K2.6 on a complex workflow orchestration spec (FlowGraph) featuring DAG validation, atomic worker claims, lease expiry recovery, pause/resume/cancel, and SSE event streaming. Claude Opus 4.7 scored 91/100 while Kimi K2.6 scored 68/100, but at roughly 19% of the cost. Both models passed their own test suites, yet code review and targeted reproductions revealed one real bug in Claude Opus 4.7 and six in Kimi K2.6. The key gaps in Kimi K2.6 were non-global claim ordering, replay-only SSE (no live streaming), expired lease acceptance on complete/fail endpoints, wrong HTTP status codes, overly narrow validation, and a broken build path. The post concludes that Kimi K2.6 is viable for scaffolding and prototyping at its price point, while Claude Opus 4.7 is the safer choice for correctness-critical state-machine logic.

11m read timeFrom blog.kilo.ai
Post cover image
Table of contents
PricingWhy a Workflow Orchestration SpecThe PromptWhat Each Model ProducedBoth Models Said Their Tests PassedClaude Opus 4.7: One Real BugKimi K2.6: Six Confirmed IssuesWhat Each Model Said About ItselfScoringCost vs QualityWhere Open-Weight Models Stand Right NowTakeawaysA Note on Kimi K2.6 Speed

Sort: