Cloudflare developed Unweight, a lossless inference-time compression system for LLM weights that achieves 15–22% model size reduction without sacrificing output quality. The system exploits the statistical redundancy in BF16 exponent bytes — where just 16 values cover 99% of weights — using Huffman coding applied selectively to MLP weight matrices. Decompression happens in fast on-chip shared memory, feeding directly to tensor cores to avoid extra HBM round-trips. Four execution pipelines adapt to different batch sizes and weight shapes, with an autotuner selecting the optimal strategy per matrix. On Llama 3.1 8B, Unweight saves ~3 GB VRAM and ~22% model size for distribution bundles, at a current throughput cost of 30–40% that is expected to narrow with further optimization. The GPU kernels and a technical paper have been open-sourced.
Table of contents
Why compression is harder than it soundsHow model weights can be compressed effectivelyThe exponent is surprisingly predictableThe GPU memory bottleneckFour ways to use compressed weightsHow the reconstructive matmul worksSharing the GPU between decoding and computationPipelining across layersAutotuningOne compression format, multiple usesOur resultsWhy this mattersWhat’s nextSort: