Unweight: how we compressed an LLM 22% without sacrificing quality

Cloudflare developed Unweight, a lossless inference-time compression system for LLM weights that achieves 15–22% model size reduction without sacrificing output quality. The system exploits the statistical redundancy in BF16 exponent bytes — where just 16 values cover 99% of weights — using Huffman coding applied selectively to MLP weight matrices. Decompression happens in fast on-chip shared memory, feeding directly to tensor cores to avoid extra HBM round-trips. Four execution pipelines adapt to different batch sizes and weight shapes, with an autotuner selecting the optimal strategy per matrix. On Llama 3.1 8B, Unweight saves ~3 GB VRAM and ~22% model size for distribution bundles, at a current throughput cost of 30–40% that is expected to narrow with further optimization. The GPU kernels and a technical paper have been open-sourced.

#cloudflare

#ai-inference

Apr 17•16m read time•From blog.cloudflare.com

Table of contents

Why compression is harder than it sounds How model weights can be compressed effectively The exponent is surprisingly predictable The GPU memory bottleneck Four ways to use compressed weights How the reconstructive matmul works Sharing the GPU between decoding and computation Pipelining across layers Autotuning One compression format, multiple uses Our results Why this matters What’s next

Comment

Bookmark

Copy

Sort: