Attention Residuals replaces the standard fixed residual accumulation with softmax attention over previous layer outputs. This enables each layer to selectively combine earlier representations using learned, input-dependent weights.

Attention Residuals replaces standard fixed residual accumulation with depth-wise softmax attention over all preceding layer outputs. This enables each layer to combine earlier representations using learned, input-dependent weights.

00:00 Intro to residual connections
03:27 Intuition behind attention residuals
04:43 Full attention residuals
09:43 Block attention residuals
13:07 Parallelism
14:21 Infrastructure design for efficient training
20:03 Infrastructure design for efficient inference
22:01 Discussions
21:02 Related work

References: 
- [Attention Residual] https://arxiv.org/abs/2603.15031

Jia-Bin Huang

Standard residual connections in transformers suffer from 'pre-norm dilution' — as depth increases, each layer's contribution shrinks relative to the growing residual stream, causing imbalanced gradients. Attention residuals fix this by replacing fixed skip-connection weights with learned, data-dependent attention weights computed over all previous layer outputs. A block-level variant (block attention residuals) groups layers into compact summaries to reduce memory and communication overhead during distributed training. With cross-stage caching and a two-phase inference computation strategy, block attention residuals achieve a 1.25x compute advantage over standard residuals, consistently outperform baselines across model sizes, and show particular gains on multi-step reasoning tasks. Notably, attention residuals favor deeper, narrower architectures compared to standard residuals.

The Residual Connection Is Broken. Here's the Fix.