ds4.c is a narrow, Metal-only local inference engine specifically built for DeepSeek V4 Flash, a 284B MoE model. It is not a generic GGUF runner but a purpose-built engine with DS4-specific loading, KV state management, and an OpenAI/Anthropic-compatible HTTP server API. Key features include: 2-bit quantization that runs on MacBooks with 128GB RAM, a disk-based KV cache that persists session state across restarts, a 1M token context window, and integration with coding agents like Claude Code, opencode, and Pi. The disk KV cache design treats SSD storage as a first-class citizen for KV state, enabling long-context inference without keeping everything in RAM. Performance benchmarks show 26 t/s generation on an M3 Max MacBook Pro and 36 t/s on an M3 Ultra Mac Studio. The project was built with significant GPT-4.5 assistance and is explicitly alpha-quality, Metal-only (CPU path crashes macOS due to a VM bug), and only works with the custom GGUF files published by the author.

15m read timeFrom github.com
Post cover image
Table of contents
Acknowledgements to llama.cpp and GGMLModel WeightsSpeedCLIServerThinking ModesDisk KV CacheBackendsTest Vectors

Sort: