From x.com
ID_AA_Carmack's profile

John Carmack @ID_AA_Carmack

I always lost performance when I tried to use silu/gelu activations in my RL value networks, and I finally understand why. If the pre-activation values are small, the smooth curve through zero is basically a linear activation, destroying the representation power of the network. You need a batch/layer/rms norm on the preactivations to put them in the range the smooth activations are designed for. Internal norms generally hurt performance on our RL tasks, but combining them with a smooth activation at least works basically as well as a raw relu (but slower). So, not actually a win, but the lightbulb of understanding was good!

Sort: