Deep dive into model quantization. Learn GGUF, GGML, and EXL2 formats, calculate VRAM requirements, and measure quality impact on inference.

SitePoint is a  web development resource that offers tutorials, articles, and courses covering a wide range of topics, from frontend technologies like HTML, CSS, and JavaScript to backend frameworks and tools like Node.js, PHP, and Ruby on Rails. With a focus on practical, hands-on learning, SitePoint provides step-by-step guides, code samples, and real-world examples to help developers master essential skills and techniques. Whether you're a beginner looking to learn the basics of web development or an experienced developer seeking to expand your knowledge, SitePoint offers resources to support your learning journey.

SitePoint

Running 70B parameter language models on consumer GPUs requires quantization to reduce memory footprint. This deep dive covers the major quantization formats—GGUF, EXL2, GPTQ, and AWQ—explaining their tradeoffs in speed, quality, and VRAM requirements. A VRAM estimation formula and Python/JavaScript calculator are provided, along with step-by-step instructions for running 70B models using llama.cpp (with CPU layer offloading) and ExLlamaV2. Perplexity benchmarks show Q4_K_M loses only ~1.5% quality vs FP16, making it the practical default for single-GPU setups. EXL2 is 2–3× faster when the full model fits in VRAM, but GGUF with CPU offloading is the only viable path for 70B models on a single 24GB card.

Quantization Explained: Run 70B Models on Consumer GPUs

What Is Quantization and Why Does It Matter?

Quantization Formats Compared: GGML, GGUF, EXL2, and AWQ

Measuring Quality Impact: How Much Do You Lose?

Decision Framework: Choosing the Right Setup