Attention Residuals (AttnRes) is a drop-in replacement for standard residual connections in Transformer architectures, developed by the Kimi team at MoonshotAI. Instead of uniformly accumulating all layer outputs with fixed unit weights, AttnRes uses softmax attention over preceding layer outputs with a learned pseudo-query per layer, enabling selective, content-aware aggregation across depth. A practical Block AttnRes variant reduces memory from O(Ld) to O(Nd) by grouping layers into blocks and applying attention only at block boundaries. Evaluated on a 48B MoE model trained on 1.4T tokens, AttnRes consistently outperforms the baseline across benchmarks, with notable gains on GPQA-Diamond (+7.5) and HumanEval (+3.1). Scaling law experiments show Block AttnRes matches the loss of a baseline trained with 1.25x more compute.
Table of contents
━━━━━━━━━━━━━━━━━━━━━━━━━━━ Attention Residuals ━━━━━━━━━━━━━━━━━━━━━━━━━━━OverviewResultsCitationSort: