Activation functions are crucial to preventing the collapse of a neural network into one big layer. Without them the number of layers in a model, sometimes described as depth, would collapse internally to function as one simple linear approximation layer. Therefore non-linearity needs to be introduced and this non-linearity is referred to as activation functions. No one best activation function has be conceived of and therefore various attempts are made in different fields and situations to find what appears to be the best fitting activation.
In the field of transformers and their unique architecture it appears that the previously seldom used gated linear unit (GLU) is coming to be seen as having a high potential. This activation function essentially does a component wise multiplication between two halves of an input with one half multiplied by a sigmoid activation function. This acts akin to a logical gate where some information is allowed to pass and some isn’t. The paper I read today goes through how different variations of this concept might be put to use in transformers for greater accuracy.
GLU Variants Improve Transformer