Kimi AI's Huge LLM Breakthrough Is Fascinating [Attention Residuals]
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
Kimi (Moonshot AI) has published research on 'attention residuals', a new LLM architectural technique that applies attention mechanisms across the depth dimension of transformer networks rather than just across tokens. Standard residual connections cause a 'pre-norm dilution problem' where earlier layer information gets progressively compressed and lost as networks deepen. Attention residuals fix this by letting each layer directly attend to all previous layer outputs with learned weights, enabling selective retrieval of earlier representations. A practical 'block attention residuals' variant groups layers into blocks to reduce quadratic scaling costs, achieving ~25% compute savings with only ~4% training overhead and under 2% inference latency increase. The approach was validated on a 48B parameter model trained on 1.4T tokens, showing improvements across all benchmarks especially multi-step reasoning tasks. The paper also compares this to MHC (multi-head compression), noting the two approaches are orthogonal but combining them likely yields diminishing returns while sacrificing efficiency.
Sort: