DeepSeek's manifold-constrained Hyper-Connections (mHC) addresses a critical instability in transformer architectures. While standard residual connections use a single information stream, Hyper-Connections expand to multiple parallel streams with learnable mixing matrices. However, unconstrained mixing matrices can amplify signals exponentially, reaching 3000x at 27B parameters. The solution constrains mixing matrices to be doubly stochastic using the Sinkhorn-Knopp algorithm, preventing amplification while allowing information routing. Experiments at 10M parameters show unconstrained HC achieves better loss (0.88 vs 1.12) but exhibits unstable 6-7x amplification, while mHC maintains perfect stability (1.00 Amax) with lower variance across seeds.

7m read timeFrom taylorkolasinski.com
Post cover image
Table of contents
The SetupThe ExplosionThe Fix: Constrain the ManifoldThe ResultsWhy This MattersTakeawaysWhat’s NextResources

Sort: