DeepSeek's manifold-constrained Hyper-Connections (mHC) addresses a critical instability in transformer architectures. While standard residual connections use a single information stream, Hyper-Connections expand to multiple parallel streams with learnable mixing matrices. However, unconstrained mixing matrices can amplify

7m read timeFrom taylorkolasinski.com
Post cover image
Table of contents
The SetupThe ExplosionThe Fix: Constrain the ManifoldThe ResultsWhy This MattersTakeawaysWhat’s NextResources

Sort: