unknown

I always lost performance when I tried to use silu/gelu activations in my RL value networks, and I finally understand why.

If the pre-activation values are small, the smooth curve through zero is basically a linear activation, destroying the representation power of the network. You need a batch/layer/rms norm on the preactivations to put them in the range the smooth activations are designed for.

Internal norms generally hurt performance on our RL tasks, but combining them with a smooth activation at least works basically as well as a raw relu (but slower). So, not actually a win, but the lightbulb of understanding was good!

kache

John Carmack shares a brief insight about performance losses when using SiLU/GELU activation functions in reinforcement learning value networks, noting he finally understands the reason behind this behavior. The tweet is truncated and lacks full context.