Running a 35B AI Model on 6GB VRAM, FAST (llama.cpp Guide)

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

A practical guide to running the Qwen 3.6 35B mixture-of-experts model on a 6GB VRAM GTX 1060 using llama.cpp. Five key flags are covered: --n-cpu-moe to offload expert blocks to CPU RAM (boosting speed from 3 to 10 tokens/sec), --no-mmap to preload the full model into RAM (13.5 t/s), tuning GPU layer count to use free VRAM (17 t/s), TurboQuant KV cache compression to expand context from 64K to 256K tokens without speed loss, and mlock to prevent kernel paging of experts during long-running sessions. Speculative decoding was tested but failed due to MoE expert thrashing and SSM layer architecture constraints.

15m watch time
1 Comment

Sort: