<h2>What DeepSeek V4 actually is</h2>
<p>DeepSeek released two models under the V4 banner: <strong>DeepSeek-V4-Pro</strong> (1.6T total parameters, 49B activated) and <strong>DeepSeek-V4-Flash</strong> (284B total, 13B activated). Both support 1 million token context windows and are released under the MIT License. The Pro model matches several frontier closed-source models on coding benchmarks — 93.5 on LiveCodeBench, 3206 Codeforces rating — while Flash runs at roughly 8% of Pro’s cost, which itself is about 14% of Claude Opus 4.7’s cost.</p>
<p>Neither model supports images or audio. There’s some degradation near the context window limits, and the paper is candid that some training stabilization techniques aren’t fully understood even by the authors.</p>
<h2>The attention architecture</h2>
<p>The headline technical contribution is a hybrid attention system that combines three compression layers:</p>
<ul>
<li><strong>Token-level compression</strong> via shared key-value vectors</li>
<li><strong>Heavily Compressed Attention (HCA)</strong> — 128-to-1 compression ratio</li>
<li><strong>Compressed Sparse Attention (CSA)</strong> — 4x compression with sparse attention patterns and a short sliding window for locality</li>
</ul>
<p>Together these reduce KV cache memory by roughly 90% compared to DeepSeek-V3.2 at 1M context, and cut inference FLOPs to about 27% of V3.2. The Pro model requires 3x less compute than its predecessor; Flash requires 10x less.</p>
<p>In practice this matters a lot. Running Flash locally against models with standard attention at large context windows is a noticeably different experience — the compressed attention keeps things usable where vanilla attention implementations fall apart.</p>
<h2>Training details</h2>
<p>DeepSeek V4 uses the <strong>Muon optimizer</strong> — the same optimizer Kimi K2 uses, with Kimi’s scaling recipe for LLM training. There’s an interesting cross-pollination here: Kimi K2 uses DeepSeek-V3 architectural components, while DeepSeek V4 adopts Kimi’s optimizer approach.</p>
<p>The architecture also introduces <strong>Manifold-Constrained Hyper-Connections</strong> for better signal propagation through the network. All three reasoning effort modes (non-think, think high, think max) are supported in both models.</p>
<h2>How it performs on real tasks</h2>
<p>A benchmark against Claude Opus 4.7 and Kimi K2.6 using a complex workflow orchestration spec (20 endpoints, lease management, retries, event streaming) gave:</p>
<ul>
<li><strong>Claude Opus 4.7</strong>: 91/100</li>
<li><strong>DeepSeek V4 Pro</strong>: 77/100 at $2.25</li>
<li><strong>Kimi K2.6</strong>: 68/100</li>
<li><strong>DeepSeek V4 Flash</strong>: 60/100 at $0.02</li>
</ul>
<p>Pro’s bugs were in lease expiry enforcement, parallel scheduling, and TypeScript build integrity. Flash had a broken route prefix preventing workflow creation and shared the expired-lease bug, but showed surprisingly solid tool-calling behavior.</p>
<p>The $0.02 cost per attempt for Flash is the number worth sitting with. For tasks that tolerate imperfect first passes — or where you’re running many attempts — the economics are genuinely different from anything else available.</p>
<h2>vLLM support</h2>
<p>vLLM now supports both V4 models. The implementation handles the heterogeneous attention types (different layers use different attention mechanisms) through a unified 256-token logical block size, compressor state managed as sliding-window KV, and page-size bucketing into three pools. Three kernel fusions deliver 1.4–20x speedups depending on the operation, with multi-stream parallelism adding another 5–6% latency reduction.</p>
<h2>Local inference with DS4</h2>
<p>Antirez built a specialized inference engine for DeepSeek V4 Flash called DS4, built on top of llama.cpp and GGML. It runs on an M3 Max 128GB at mixed 2-bit/4-bit quantization (4-bit for the last six layers). A 50-minute demo shows it building a small programming language in real time — not accelerated.</p>
<p>One observation from running it: the small KV cache means storing and loading sessions from disk actually works well. On fast SSD storage like a MacBook’s, the assumption that disk is a bad target for KV cache doesn’t hold anymore. Upcoming work on DS4 targets flatter prefill as context grows, which matters most on DGX Spark hardware.</p>
<p>DS4 also includes activation steering — extracting concept vectors from model activations and boosting them during inference. This has been theoretically interesting for years but practically inaccessible because it requires model weights. With a capable open-weight model running locally, it becomes something you can actually experiment with. Whether steering for “unpromptable” concepts or compressing large context into implicit memory turns out to be useful in practice is still an open question.</p>
<h2>The broader picture</h2>
<p>DeepSeek released strong open base models, skipped the benchmark-optimization theater, and left post-training to whoever wants to pick it up. The open-weight ecosystem now has a 1M-context model with genuinely novel attention architecture at a cost point that changes what’s feasible to run. The gap between open and proprietary models on surface coverage is closing; correctness on hard edge cases (lease recovery, cross-run scheduling) is where frontier models still pull ahead.</p>


Collections

DeepSeek released two new open-weight models under the V4 banner: V4-Pro (1.6T parameters, 49B activated) and V4-Flash (284B, 13B activated), both supporting 1M token context windows under MIT License. The key architectural innovation is a hybrid attention system combining token-level compression, Heavily Compressed Attention (128:1 ratio), and Compressed Sparse Attention, reducing KV cache memory by ~90% vs V3.2. V4-Pro scores 93.5 on LiveCodeBench and costs ~14% of Claude Opus 4.7, while Flash costs just $0.02 per complex task attempt. Training uses the Muon optimizer borrowed from Kimi K2. vLLM now supports both models with kernel fusions delivering 1.4–20x speedups. A local inference engine called DS4 (built on llama.cpp/GGML) runs Flash on an M3 Max 128GB at mixed 2/4-bit quantization, enabling features like activation steering and disk-based KV cache sessions.

DeepSeek-V4 preview: two MoE models with 1M context and new attention mechanism