Microsoft's BitNet b1.58 paper proposes a novel LLM architecture where every model weight is constrained to ternary values (-1, 0, or 1), requiring only 1.58 bits per weight. Unlike post-training quantization, the model is trained from scratch using absolute mean quantization via a BitLinear layer. This eliminates matrix multiplications in favor of simple additions, enabling significant memory and latency reductions. Benchmarks against reproduced LLaMA models show BitNet b1.58 uses 3.3x less memory and runs 2.7–4.1x faster at 3B–70B scales, while matching or slightly exceeding LLaMA's accuracy. The efficiency gains grow as model size increases, making this approach particularly promising for large-scale deployment.

6m watch time

Sort: