The Residual Connection Is Broken. Here's the Fix.
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
Standard residual connections in transformers suffer from 'pre-norm dilution' — as depth increases, each layer's contribution shrinks relative to the growing residual stream, causing imbalanced gradients. Attention residuals fix this by replacing fixed skip-connection weights with learned, data-dependent attention weights computed over all previous layer outputs. A block-level variant (block attention residuals) groups layers into compact summaries to reduce memory and communication overhead during distributed training. With cross-stage caching and a two-phase inference computation strategy, block attention residuals achieve a 1.25x compute advantage over standard residuals, consistently outperform baselines across model sizes, and show particular gains on multi-step reasoning tasks. Notably, attention residuals favor deeper, narrower architectures compared to standard residuals.
Sort: