Run a 35B parameter AI model on just 6GB VRAM using llama.cpp and Qwen 3.6.

This setup shouldn’t work—but with the right optimizations, it reaches good enough tps on a GTX 1060.

In this video, I break down how to run large language models locally on low VRAM GPUs using MoE offloading, memory tuning, and a few critical flags that dramatically improve performance.

What you’ll learn:
• How to run 35B LLMs on 6GB VRAM
• llama.cpp optimization techniques
• MoE (Mixture of Experts) offloading explained
• Fixing slow token generation (3 tok/s → 17 tok/s)
• Using --no-mmap and --mlock for performance and stability
• TurboQuant for increasing context length
• What doesn’t work (and why)

Hardware used:
• NVIDIA GTX 1060 (6GB VRAM)
• Intel i3-8100
• 24GB RAM

Tech stack:
Proxmox → LXC → Docker → llama.cpp (adapt based on your setup)

Useful resources:
• Qwen 3.6 35B-A3B model: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
• TurboQuant paper: https://arxiv.org/abs/...
• llama.cpp TurboQuant fork: https://github.com/TheTom/llama-cpp-turboquant

If you're interested in running AI locally, optimizing LLM performance, or pushing old hardware to its limits, subscribe for more experiments.

Chapters:
00:00 This shouldn’t work
00:27 Setup
01:46 Why it’s slow by default
02:52 MoE breakthrough
04:33 Fixing memory bottlenecks
05:32 Hitting 17 tok/s
06:40 4× context trick
09:23 Stability fix
11:04 What failed
13:32 The 5 flags

#LocalAI #LLM #llamacpp #Qwen #AIonGPU #LowVRAM

YouTube

A practical guide to running the Qwen 3.6 35B mixture-of-experts model on a 6GB VRAM GTX 1060 using llama.cpp. Five key flags are covered: --n-cpu-moe to offload expert blocks to CPU RAM (boosting speed from 3 to 10 tokens/sec), --no-mmap to preload the full model into RAM (13.5 t/s), tuning GPU layer count to use free VRAM (17 t/s), TurboQuant KV cache compression to expand context from 64K to 256K tokens without speed loss, and mlock to prevent kernel paging of experts during long-running sessions. Speculative decoding was tested but failed due to MoE expert thrashing and SSM layer architecture constraints.

Running a 35B AI Model on 6GB VRAM, FAST (llama.cpp Guide)

<p>Take this. Drop llama-swap or llama-swapo in front of llama-server and you have a drop-in replacement for tools that expect Ollama.</p>