Residual connections have been a cornerstone of deep learning since ResNets, enabling gradient flow and stable training of deep models. Hyper Connections (mHC) extend this by expanding input features into multiple parallel residual streams, using learnable aggregation, expansion, and feature-mixing matrices to dynamically route information across streams. This yields up to 1.8x faster convergence over standard residuals. However, DeepSeek found that unconstrained feature-mixing matrices cause training instability — values slightly above or below 1 compound exponentially across layers. The fix is constraining the mixing matrix to be doubly stochastic (all elements positive, rows and columns summing to 1) via the Sinkhorn algorithm. DeepSeek also switched activation functions from tanh to sigmoid and added a scalar factor of 2 to ensure hyperconnections initialize identically to standard residuals. Three infrastructure optimizations (fused kernels, activation recomputation, and pipeline overlapping) keep the training overhead to just 6.7% with an expansion rate of 4.

13m watch time

Sort: