<p>Deepseek just broke the one rule every transformer has followed for a decade 🤯

x + f(x). the residual connection.

if you don't know what that means, here's the simple version: every time a neural network processes your input through a layer, it keeps a copy of the original and https://t.co/rkf0iKiiAp</p>

Robert Youssef

DeepSeek has reportedly modified the foundational residual connection architecture (x + f(x)) that every major transformer model has used since 2015. Instead of a single residual stream, they split information into 4 parallel streams with learned mixing matrices controlling inter-stream communication at each layer — aiming for richer information flow at the same computational cost. The post teases that this approach has notable failure modes but cuts off before explaining them.

Deepseek just broke the one rule every transformer has followed for a decade 🤯

x + f(x). the residual connection.

if you don't know what that means, here's the simple version: every time a neural network processes your input through a layer, it keeps a copy of the original and adds it back at the end. like a safety net. if the layer screws up, the original signal survives.

gpt-4 uses it. claude uses it. gemini uses it. every major model since 2015 treats this as sacred. nobody touches it.

Deepseek touched it.

instead of 1 stream carrying your data forward, they split it into 4 parallel streams. each stream carries different aspects of the information. and learned mixing matrices decide how those streams talk to each other at every layer.

more lanes on the highway. smarter traffic control. same computational cost.

sounds perfect on paper. here's where it breaks:

<p>Deepseek just broke the one rule every transformer has followed for a decade 🤯

x + f(x). the residual connection.

if you don't know what that means, here's the simple version: every time a neural network processes your input through a layer, it keeps a copy of the original and adds it back at the end. like a safety net. if the layer screws up, the original signal survives.

gpt-4 uses it. claude uses it. gemini uses it. every major model since 2015 treats this as sacred. nobody touches it.

Deepseek touched it.

instead of 1 stream carrying your data forward, they split it into 4 parallel streams. each stream carries different aspects of the information. and learned mixing matrices decide how those streams talk to each other at every layer.

more lanes on the highway. smarter traffic control. same computational cost.

sounds perfect on paper. here's where it breaks:</p>