Sam McLeod

An interactive visual guide to understanding GGUF quantization for large language models. Covers the trade-offs between model size, quality (perplexity), and performance across different quantization types (Q2 through Q8, IQ variants, K-quants). Includes a heatmap comparing quantization types on CUDA vs Metal hardware, a sweet-spot table mapping model sizes (3B–110B) to VRAM constraints (8GB–64GB), efficiency charts showing perplexity-per-GB-saved across model sizes, and a decision tree for choosing the right quantization level based on quality vs. size priorities and available hardware.

Understanding AI/LLM Quantisation Through Interactive Visualisations

GGUF Quantisation Sweet Spots (8K Context)

Right-sizing model quantisation for your (v)RAM

GGUF Quantisation Efficiency vs Quality Across Model Sizes