A deep dive into batch decoding of unary codes using a Tunstall-style table-driven approach. The post starts with a standard one-at-a-time bit-stream decoder, then introduces a 256-entry lookup table that processes 8 bits at a time, emitting multiple decoded values per iteration. By packing up to 8 output values into a single uint64 table entry and using population count to track output count, the loop's critical path reduces to a single clock cycle. Benchmarks on Zen 4 show ~9x speedup over the naive decoder. The broader argument is that variable-length encodings that mix everything into one serial stream leave massive parallelism gains on the table, and splitting codes like Rice or Exp-Golomb into their constituent parts (unary prefix + fixed suffix) enables SIMD-friendly batch processing — a technique already used in Oodle's Kraken codec for a decade.
Table of contents
Tunstall-style decodingThe optimized versionSo what?Appendix: table generationRelatedSort: