DeepSeek-V4: a million-token context that agents can actually use

DeepSeek-V4 is a new frontier open model designed specifically for long-running agentic workloads. It introduces a hybrid attention architecture combining Compressed Sparse Attention (CSA, 4x compression) and Heavily Compressed Attention (HCA, 128x compression), reducing KV cache memory to roughly 2% of standard grouped query attention. V4-Pro requires only 27% of the single-token inference FLOPs of V3.2, and V4-Flash drops to 10%. Key agent-specific improvements include preserved reasoning traces across tool-call boundaries and user turns, a new XML-based tool-call schema with dedicated tokens to reduce parsing failures, and a Rust-based sandbox infrastructure (DSec) used for RL training against real tool environments. On agent benchmarks, V4-Pro-Max reaches 80.6 on SWE Verified, 73.6 on MCPAtlas, and 67.9 on Terminal Bench 2.0, placing it at parity with frontier closed models. Four model checkpoints (Pro and Flash, instruct and base) are available on Hugging Face Hub.

#llm

#ai-agents

#deepseek

#mixture-of-experts

Apr 24•7m read time•From huggingface.co

Table of contents

The KV cache problem for agents Hybrid attention: CSA and HCA What changes for agents Agent benchmark results Using the models