Skip to main content

Activation Functions

FunctionFormulaDerivativeRangeWhere used
ReLUmax(0,z)\max(0,z){1if z>00otherwise\begin{cases} 1 & \text{if } z>0 \\ 0 & \text{otherwise} \end{cases}[0,)[0,\infty)Hidden layers (CNNs, MLPs)
Leaky ReLU{zif z>0αzotherwise\begin{cases} z & \text{if } z>0 \\ \alpha z & \text{otherwise} \end{cases}{1if z>0αotherwise\begin{cases} 1 & \text{if } z>0 \\ \alpha & \text{otherwise} \end{cases}(,)(-\infty,\infty)Hidden layers when dead ReLUs are a problem
ELU{zif z>0α(ez1)otherwise\begin{cases} z & \text{if } z>0 \\ \alpha(e^z-1) & \text{otherwise} \end{cases}{1if z>0αezotherwise\begin{cases} 1 & \text{if } z>0 \\ \alpha e^z & \text{otherwise} \end{cases}(α,)(-\alpha,\infty)Hidden layers; smooth alternative to ReLU; pushes mean activations toward zero
GELUzΦ(z)z \cdot \Phi(z)smooth, no kink(0.17,)\approx (-0.17,\infty)Modern transformers (BERT, GPT, ViT)
Sigmoid11+ez\frac{1}{1+e^{-z}}σ(z)(1σ(z))\sigma(z)\left(1-\sigma(z)\right)(0,1)(0,1)Binary classifier output; LSTM gates
Tanhezezez+ez=2σ(2z)1\frac{e^z-e^{-z}}{e^z+e^{-z}} = 2\sigma(2z) -11tanh2(z)1-\tanh^2(z)(1,1)(-1,1)LSTM/GRU hidden state
Softmaxezijezj\frac{e^{z_i}}{\sum_j e^{z_j}}piyip_i-y_i for CE loss(0,1)K(0,1)^K summing to 11Multi-class output; attention

Notes

ELU Family

Bog-standard ReLU has a ‘kink’ that doesn’t make it differentiable. It also has a “dead node” problem: a node will output zero for inputs and stops ‘learning’ (you can set the value of the gradient to zero at the kink and piss off Mathematicians.) Leaky ReLU and ELU try and solve this problem. In PyTorch, for example, α=0.01\alpha = 0.01. There’s also Scaled ELU (and I’m sure several other variants.) But whatever: these are extensions of ReLU that are smooth and differentiable and make sure your node’s not dead 💀

GeLU is a smoother (albeit more complex) variant: GELU(z)0.5z(1+tanh[2π(z+0.044715z3)])\mathrm{GELU}(z) \approx 0.5z \left( 1 + \tanh\left[ \sqrt{\frac{2}{\pi}} \left( z + 0.044715z^3 \right) \right] \right)

Oh well. ReLU is widely used because it’s so simple to understand and cheap computationally. Be a ReLU.

The Sigmoid

Shows up the output layer of Binary Classifiers. You’ll almost never see it in the hidden layers. Why? Because when zz is very large, the gradients go poof (σ(z)0\sigma'(z)\approx 0)

Here’s something cool: when the Sigmoid is the output and Binary Cross-Entropy is the Loss,

Lz=σ(z)y\frac{\partial{L}}{\partial{z}} = \sigma(z) - y

The gradient is just “prediced minus actual”!! This is why classification behaves nicely.

In Logistic Regression

Say you have this: z=β0+β1x1+β2x2z=\beta_0​+\beta_1​x_1​+\beta_2​x_2​. Just plop the numbers in and get zz. Say you get 2.42.4. You want to know p^=P(Y=1X=x)\hat{p} = P(Y = 1|X=x). Logistic Regression will put zz through a Sigmoid:

p^=11+ez=11+e2.40.917\hat{p} = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-2.4}} \approx 0.917

That’s all. You have a large positive zz? Sigmoid will move it close to one. Large negative zz? Close to zero.

Tanh

You’ll see this in older RNNs, LSTMs, and GRUs. It’s centered at zero (range is [1,1][-1, 1]) which is nice when you don’t want all the gradients to have the same sign. Same problem as Sigmoids with large zz though…

Vanishing Gradients

Sigmoid and tanhtanh are nice but their gradients go to very small values and quickly. So in gradient descent, the parameters (weight and bias) may not change between training epochs. Behold the “Vanishing Gradient Problem”.

SoftMax ❤️

The actual derivative is funky:

softmax(z)izj={pi(1pi)i=jpipjij\frac{\partial \,\mathrm{softmax}(z)_i}{\partial z_j} = \begin{cases} p_i(1-p_i) & i=j \\ -p_i p_j & i\neq j \end{cases}

But in Cross-Entropy training, it’s the same energy as the Sigmoid:

Lzi=piyi\frac{\partial L}{\partial z_i} = p_i - y_i

Now this badboy shows up in quite a few of places:

  • It’s how you turn the outputs of a multi-class classifier into Probabilities (keep reading)
  • Inside the attention mechanism of Transformers. I’ll update this later when I acutally freaking understand how they work…

A Sigmoid is a special case of a SoftMax with just two classes (11+e(z1z2))\left(\frac{1}{1 + e^{-(z_1 - z_2)}}\right). How nice is that? As noted, SoftMax is used in MultiClass models (Sigmoids are used in MultiLabel Classifiers.) It looks like this:

pk=ezkjKezjp_k = \frac{e^{z_k}}{\sum_j^K{e^{z_j}}}

What’s happening there? Notice that it takes all the other classes into account. It ‘moves the Probability Mass’ around, between the classes. It can also give you a nice pk[1,10]=[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1]p_{k \in [1,10]} = [0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1] which maxxes out Entropy/Surprise/Uncertainty.

This is also used in Convolutional Neural Networks as the last layer when you want to spit out probabilities (e.g. when you want to classify Pokemons. There’s 800 of them so a Pikachu in your amazing CNN would be [0.01,...,0.91,0.18,0.33,...,0.003][0.01, ..., \textbf{0.91}, 0.18, 0.33, ..., 0.003] for example