Residual connections are adopted in virtually every deep learning model. 
BUT, can we further improve the residual connections? Hyper-connections are an exciting recent exploration to generalize residual connections. 

In this video, we will cover the following:

00:00 Training deep models
00:46 Residual connections
02:45 Hyper-connections - Intuition
04:48 Hyper-connections - Math
05:54 Dynamic hyper-connections
06:47 Training instability
07:52 Example of instability 
08:44 mHC: Stabilizing training
11:02 mHC: Improving parametrization
12:30 mHC: Efficient infrastructure designs

References:
- [ResNet] https://arxiv.org/abs/1512.03385
- [Identity mapping] https://arxiv.org/abs/1603.05027
- [Hyper-Connections] https://arxiv.org/abs/2409.19606
- [mHC] https://arxiv.org/abs/2512.24880

Other explorations for improving residual connections (not discussed in this video):
- [DenseNet] https://arxiv.org/abs/1608.06993
- [FractalNet] https://arxiv.org/abs/1605.07648
- [Residual Matrix Transformers]: https://arxiv.org/abs/2506.22696
- [MUDDFormer] https://arxiv.org/abs/2502.12170

Video made with Manim: https://www.manim.community/

Jia-Bin Huang

Residual connections have been a cornerstone of deep learning since ResNets, enabling gradient flow and stable training of deep models. Hyper Connections (mHC) extend this by expanding input features into multiple parallel residual streams, using learnable aggregation, expansion, and feature-mixing matrices to dynamically route information across streams. This yields up to 1.8x faster convergence over standard residuals. However, DeepSeek found that unconstrained feature-mixing matrices cause training instability — values slightly above or below 1 compound exponentially across layers. The fix is constraining the mixing matrix to be doubly stochastic (all elements positive, rows and columns summing to 1) via the Sinkhorn algorithm. DeepSeek also switched activation functions from tanh to sigmoid and added a scalar factor of 2 to ensure hyperconnections initialize identically to standard residuals. Three infrastructure optimizations (fused kernels, activation recomputation, and pipeline overlapping) keep the training overhead to just 6.7% with an expansion rate of 4.

How mHC Reinvents Residual Connections