Key takeaways from livestreaming DeepSeek R-1 671B (4-bit) on a 14x RTX 3090 basement AI server. See how KTransformers crushed llama.cpp in prompt eval speeds, compare setups, and get real-world insights into massive LLM inference with vLLM, ExLlamaV2, and more.

Osman's Odyssey: Byte & Build

A detailed comparison of running DeepSeek R-1 671B model using KTransformers versus llama.cpp on a 14x RTX 3090 setup. KTransformers achieved 15x faster prompt evaluation speeds compared to llama.cpp, with performance metrics of 9.18 tokens/sec for prompt evaluation and 8.24 tokens/sec for generation. The experiment used 13GB GPU offload and ~390GB CPU offload, demonstrating real-world inference capabilities on large language models.

Key Highlights From Running DeepSeek R-1 671B on 14x RTX 3090s + Epyc 7713 & 512GB RAM : How KTransformers Dominated llama.cpp in Real-World Inference · Osman's Odyssey: Byte & Build

How KTransformers Dominated llama.cpp in Real-World Inference

Biggest Takeaway: KTransformers Crushed llama.cpp in Prompt Eval Speeds