Running local LLMs on 8GB GPUs is practical with the right model selection and quantization. A 7B–8B parameter model at Q4_K_M quantization fits within 5GB VRAM, leaving room for KV cache. The guide covers GGUF quantization formats (Q4_K_M, Q5_K_M, Q8_0), model recommendations per VRAM tier (8GB, 6GB, 4GB), and step-by-step setup for both Ollama and llama.cpp. Key optimizations include reducing context window from 4096 to 2048 tokens (freeing 1–2GB VRAM), explicit quantization tag selection when pulling models, GPU layer offloading via num_gpu/ngl flags, and partial CPU offloading for 14B models. Common pitfalls include pulling default FP16 model tags, setting maximum context lengths, thermal throttling on older cards, and unrealistic quality expectations from small quantized models.

19m read timeFrom sitepoint.com
Post cover image
Table of contents
Table of ContentsYou Don't Need a $1,500 GPU to Run LLMs LocallyHow to Run Local LLMs on an 8GB GPUUnderstanding VRAM, Quantization, and Why They MatterBest Models for 8GB, 6GB, and 4GB VRAMSetting Up Ollama for Low-VRAM GPUsAdvanced Optimization with llama.cppPerformance Tuning Tips and Common PitfallsWhat's Actually Possible on Budget Hardware

Sort: