An interactive visual guide to understanding GGUF quantization for large language models. Covers the trade-offs between model size, quality (perplexity), and performance across different quantization types (Q2 through Q8, IQ variants, K-quants). Includes a heatmap comparing quantization types on CUDA vs Metal hardware, a sweet-spot table mapping model sizes (3B–110B) to VRAM constraints (8GB–64GB), efficiency charts showing perplexity-per-GB-saved across model sizes, and a decision tree for choosing the right quantization level based on quality vs. size priorities and available hardware.

7m read timeFrom smcleod.net
Post cover image
Table of contents
Disclaimer #Quantisation #Dashboard #Insights on Quantisation and Model SizePerplexity Increase vs CompressionGGUF Quantisation Spectrum HeatmapUnderstanding GGUF QuantisationGGUF Quantisation Sweet Spots (8K Context)Right-sizing model quantisation for your (v)RAMGGUF Quantisation Efficiency vs Quality Across Model SizesGGUF Quantisation Decision TreeVisualisations Explained #References #

Sort: