DeepSeek's manifold-constrained Hyper-Connections (mHC) addresses a critical instability in transformer architectures. While standard residual connections use a single information stream, Hyper-Connections expand to multiple parallel streams with learnable mixing matrices. However, unconstrained mixing matrices can amplify

7m read time From taylorkolasinski.com
Post cover image
Table of contents
The SetupThe ExplosionThe Fix: Constrain the ManifoldThe ResultsWhy This MattersTakeawaysWhat’s NextResources

Sort: