Activation Functions
| Function | Formula | Derivative | Range | Where used |
|---|---|---|---|---|
| ReLU | Hidden layers (CNNs, MLPs) | |||
| Leaky ReLU | Hidden layers when dead ReLUs are a problem | |||
| ELU | Hidden layers; smooth alternative to ReLU; pushes mean activations toward zero | |||
| GELU | smooth, no kink | Modern transformers (BERT, GPT, ViT) | ||
| Sigmoid | Binary classifier output; LSTM gates | |||
| Tanh | LSTM/GRU hidden state | |||
| Softmax | for CE loss | summing to | Multi-class output; attention |
Notes
ELU Family
Bog-standard ReLU has a ‘kink’ that doesn’t make it differentiable. It also has a “dead node” problem: a node will output zero for inputs and stops ‘learning’ (you can set the value of the gradient to zero at the kink and piss off Mathematicians.) Leaky ReLU and ELU try and solve this problem. In PyTorch, for example, . There’s also Scaled ELU (and I’m sure several other variants.) But whatever: these are extensions of ReLU that are smooth and differentiable and make sure your node’s not dead 💀
GeLU is a smoother (albeit more complex) variant:
Oh well. ReLU is widely used because it’s so simple to understand and cheap computationally. Be a ReLU.
The Sigmoid
Shows up the output layer of Binary Classifiers. You’ll almost never see it in the hidden layers. Why? Because when is very large, the gradients go poof ()
Here’s something cool: when the Sigmoid is the output and Binary Cross-Entropy is the Loss,
The gradient is just “prediced minus actual”!! This is why classification behaves nicely.
In Logistic Regression
Say you have this: . Just plop the numbers in and get . Say you get . You want to know . Logistic Regression will put through a Sigmoid:
That’s all. You have a large positive ? Sigmoid will move it close to one. Large negative ? Close to zero.
Tanh
You’ll see this in older RNNs, LSTMs, and GRUs. It’s centered at zero (range is ) which is nice when you don’t want all the gradients to have the same sign. Same problem as Sigmoids with large though…
Sigmoid and are nice but their gradients go to very small values and quickly. So in gradient descent, the parameters (weight and bias) may not change between training epochs. Behold the “Vanishing Gradient Problem”.
SoftMax ❤️
The actual derivative is funky:
But in Cross-Entropy training, it’s the same energy as the Sigmoid:
Now this badboy shows up in quite a few of places:
- It’s how you turn the outputs of a multi-class classifier into Probabilities (keep reading)
- Inside the attention mechanism of Transformers. I’ll update this later when I acutally freaking understand how they work…
A Sigmoid is a special case of a SoftMax with just two classes . How nice is that? As noted, SoftMax is used in MultiClass models (Sigmoids are used in MultiLabel Classifiers.) It looks like this:
What’s happening there? Notice that it takes all the other classes into account. It ‘moves the Probability Mass’ around, between the classes. It can also give you a nice which maxxes out Entropy/Surprise/Uncertainty.
This is also used in Convolutional Neural Networks as the last layer when you want to spit out probabilities (e.g. when you want to classify Pokemons. There’s 800 of them so a Pikachu in your amazing CNN would be for example