Activation Functions

Function	Formula	Derivative	Range	Where used
ReLU	$\max(0,z)$	$\begin{cases} 1 & \text{if } z>0 \\ 0 & \text{otherwise} \end{cases}$	$[0,\infty)$	Hidden layers (CNNs, MLPs)
Leaky ReLU	$\begin{cases} z & \text{if } z>0 \\ \alpha z & \text{otherwise} \end{cases}$	$\begin{cases} 1 & \text{if } z>0 \\ \alpha & \text{otherwise} \end{cases}$	$(-\infty,\infty)$	Hidden layers when dead ReLUs are a problem
ELU	$\begin{cases} z & \text{if } z>0 \\ \alpha(e^z-1) & \text{otherwise} \end{cases}$	$\begin{cases} 1 & \text{if } z>0 \\ \alpha e^z & \text{otherwise} \end{cases}$	$(-\alpha,\infty)$	Hidden layers; smooth alternative to ReLU; pushes mean activations toward zero
GELU	$z \cdot \Phi(z)$	smooth, no kink	$\approx (-0.17,\infty)$	Modern transformers (BERT, GPT, ViT)
Sigmoid	$\frac{1}{1+e^{-z}}$	$\sigma(z)\left(1-\sigma(z)\right)$	$(0,1)$	Binary classifier output; LSTM gates
Tanh	$\frac{e^z-e^{-z}}{e^z+e^{-z}} = 2\sigma(2z) -1$	$1-\tanh^2(z)$	$(-1,1)$	LSTM/GRU hidden state
Softmax	$\frac{e^{z_i}}{\sum_j e^{z_j}}$	$p_i-y_i$ for CE loss	$(0,1)^K$ summing to $1$	Multi-class output; attention

Notes

ELU Family

Bog-standard ReLU has a ‘kink’ that doesn’t make it differentiable. It also has a “dead node” problem: a node will output zero for inputs and stops ‘learning’ (you can set the value of the gradient to zero at the kink and piss off Mathematicians.) Leaky ReLU and ELU try and solve this problem. In PyTorch, for example, $\alpha = 0.01$ . There’s also Scaled ELU (and I’m sure several other variants.) But whatever: these are extensions of ReLU that are smooth and differentiable and make sure your node’s not dead 💀

GeLU is a smoother (albeit more complex) variant: $\mathrm{GELU}(z) \approx 0.5z \left( 1 + \tanh\left[ \sqrt{\frac{2}{\pi}} \left( z + 0.044715z^3 \right) \right] \right)$

Oh well. ReLU is widely used because it’s so simple to understand and cheap computationally. Be a ReLU.

The Sigmoid

Shows up the output layer of Binary Classifiers. You’ll almost never see it in the hidden layers. Why? Because when $z$ is very large, the gradients go poof ( $\sigma'(z)\approx 0$ )

Here’s something cool: when the Sigmoid is the output and Binary Cross-Entropy is the Loss,

\frac{\partial{L}}{\partial{z}} = \sigma(z) - y

The gradient is just “prediced minus actual”!! This is why classification behaves nicely.

In Logistic Regression

Say you have this: $z=\beta_0+\beta_1x_1+\beta_2x_2$ . Just plop the numbers in and get $z$ . Say you get $2.4$ . You want to know $\hat{p} = P(Y = 1|X=x)$ . Logistic Regression will put $z$ through a Sigmoid:

\hat{p} = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-2.4}} \approx 0.917

That’s all. You have a large positive $z$ ? Sigmoid will move it close to one. Large negative $z$ ? Close to zero.

Tanh

You’ll see this in older RNNs, LSTMs, and GRUs. It’s centered at zero (range is $[-1, 1]$ ) which is nice when you don’t want all the gradients to have the same sign. Same problem as Sigmoids with large $z$ though…

Vanishing Gradients

Sigmoid and $tanh$ are nice but their gradients go to very small values and quickly. So in gradient descent, the parameters (weight and bias) may not change between training epochs. Behold the “Vanishing Gradient Problem”.

SoftMax ❤️

The actual derivative is funky:

\frac{\partial \,\mathrm{softmax}(z)_i}{\partial z_j} = \begin{cases} p_i(1-p_i) & i=j \\ -p_i p_j & i\neq j \end{cases}

But in Cross-Entropy training, it’s the same energy as the Sigmoid:

\frac{\partial L}{\partial z_i} = p_i - y_i

Now this badboy shows up in quite a few of places:

It’s how you turn the outputs of a multi-class classifier into Probabilities (keep reading)
Inside the attention mechanism of Transformers. I’ll update this later when I acutally freaking understand how they work…

A Sigmoid is a special case of a SoftMax with just two classes $\left(\frac{1}{1 + e^{-(z_1 - z_2)}}\right)$ . How nice is that? As noted, SoftMax is used in MultiClass models (Sigmoids are used in MultiLabel Classifiers.) It looks like this:

p_k = \frac{e^{z_k}}{\sum_j^K{e^{z_j}}}

What’s happening there? Notice that it takes all the other classes into account. It ‘moves the Probability Mass’ around, between the classes. It can also give you a nice $p_{k \in [1,10]} = [0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1]$ which maxxes out Entropy/Surprise/Uncertainty.

This is also used in Convolutional Neural Networks as the last layer when you want to spit out probabilities (e.g. when you want to classify Pokemons. There’s 800 of them so a Pikachu in your amazing CNN would be $[0.01, ..., \textbf{0.91}, 0.18, 0.33, ..., 0.003]$ for example

Notes​

ELU Family​

The Sigmoid​

In Logistic Regression​

Tanh​

SoftMax ❤️​