KL divergence (KLD) is used to measure how much a quantised LLM deviates from a bf16 reference, providing a better quality signal than perplexity by isolating quantisation damage. Using the mlx-kld tool, six MLX quantisations of Qwen 3.6 27B (dense) and five of Qwen 3.6 35B-A3B (MoE) are benchmarked across 65,536 tokens from WikiText-2. Key findings: 8-bit is effectively a bf16 drop-in; 6-bit RTN and oQ6 are statistically tied for the dense model; Unsloth Dynamic 4-bit is a poor deal at 8.6 effective bpw with worse quality than 6-bit; for the MoE model, DWQ-4bit beats oQ4 by ~34% because it protects router tensors; quant rankings don't generalise between dense and MoE architectures. The mlx-kld tool caches reference log-probs as a sparse top-K structure (~420x compression) and loads only one model at a time, making multi-quant comparisons practical on Apple Silicon.

13m read timeFrom smcleod.net
Post cover image
Table of contents
What KLD measures #Evaluation modes #Qwen 3.6 27B results #Qwen 3.6 35B-A3B results #What I’d actually run #Measuring KLD with mlx-kld #Sanity checks and approximation error #Caveats #

Sort: