Large Language Models (LLMs) are often too large to efficiently run on consumer hardware due to their extensive number of parameters. Quantization is a technique used to reduce the model size by decreasing the precision of the parameters from higher bit-widths (like 32-bit floating point) to lower bit-widths (like 8-bit integers), which helps in minimizing memory usage while trying to maintain model accuracy. Different types of quantization methods such as symmetric and asymmetric quantization, as well as post-training quantization (PTQ) and quantization-aware training (QAT), are explored. Advanced methods, including GPTQ and BitNet, are used to push the limits of quantization, reducing bit usage down to 1 or 1.58 bits without significantly compromising performance.

22m read timeFrom newsletter.maartengrootendorst.com
Post cover image
Table of contents
How to Represent Numerical ValuesMemory ConstraintsCommon Data TypesSymmetric QuantizationAsymmetric QuantizationRange Mapping and ClippingCalibrationDynamic QuantizationStatic QuantizationThe Realm of 4-bit QuantizationThe Era of 1-bit LLMs: BitNetAll Large Language Models are in 1.58 BitsConclusionResources

Sort: