A detailed comparison of running DeepSeek R-1 671B model using KTransformers versus llama.cpp on a 14x RTX 3090 setup. KTransformers achieved 15x faster prompt evaluation speeds compared to llama.cpp, with performance metrics of 9.18 tokens/sec for prompt evaluation and 8.24 tokens/sec for generation. The experiment used 13GB
Table of contents
How KTransformers Dominated llama.cpp in Real-World InferenceWhy This Experiment?Key Highlights from the StreamBiggest Takeaway: KTransformers Crushed llama.cpp in Prompt Eval SpeedsWatch the Full Stream RecordingSort: