Large Language Models (LLMs) are often too large to efficiently run on consumer hardware due to their extensive number of parameters. Quantization is a technique used to reduce the model size by decreasing the precision of the parameters from higher bit-widths (like 32-bit floating point) to lower bit-widths (like 8-bit integers), which helps in minimizing memory usage while trying to maintain model accuracy. Different types of quantization methods such as symmetric and asymmetric quantization, as well as post-training quantization (PTQ) and quantization-aware training (QAT), are explored. Advanced methods, including GPTQ and BitNet, are used to push the limits of quantization, reducing bit usage down to 1 or 1.58 bits without significantly compromising performance.
Table of contents
How to Represent Numerical ValuesMemory ConstraintsCommon Data TypesSymmetric QuantizationAsymmetric QuantizationRange Mapping and ClippingCalibrationDynamic QuantizationStatic QuantizationThe Realm of 4-bit QuantizationThe Era of 1-bit LLMs: BitNetAll Large Language Models are in 1.58 BitsConclusionResourcesSort: