ds4.c is a narrow, Metal-only local inference engine specifically built for DeepSeek V4 Flash, a 284B MoE model. It is not a generic GGUF runner but a purpose-built engine with DS4-specific loading, KV state management, and an OpenAI/Anthropic-compatible HTTP server API. Key features include: 2-bit quantization that runs on MacBooks with 128GB RAM, a disk-based KV cache that persists session state across restarts, a 1M token context window, and integration with coding agents like Claude Code, opencode, and Pi. The disk KV cache design treats SSD storage as a first-class citizen for KV state, enabling long-context inference without keeping everything in RAM. Performance benchmarks show 26 t/s generation on an M3 Max MacBook Pro and 36 t/s on an M3 Ultra Mac Studio. The project was built with significant GPT-4.5 assistance and is explicitly alpha-quality, Metal-only (CPU path crashes macOS due to a VM bug), and only works with the custom GGUF files published by the author.
Table of contents
Acknowledgements to llama.cpp and GGMLModel WeightsSpeedCLIServerThinking ModesDisk KV CacheBackendsTest VectorsSort: