# Activation Functions

Written
2023-04-30
— Updated
2023-05-01

- An activation function is simply a function in a layer that acts on the input.
- Assuming input to the function is already scaled... $x = mx' + b$
- Sigmoid
- $1 / (1 + e^{-x})$
- This is the most commonly taught activation function, but it's not commonly seen in real systems, except sometimes at the output stage of a single-output network.
- The small range of the derivative reduces its usefulness, especially as more layers are added.

- ReLU
- Rectified Linear Unit
- This has been the most-used function in deep learning.
- $y = max(x, 0)$
- Mostly just a passthrough, but output is clamped to positive values.
- Because the function is not smooth, we just fudge the derivative a little bit and say that $f'$ at the point where $x = 0$ is just 0 or 1.
- Advantages:
- Similar to biological neuron in that it's fixed until some minimum activation.
- Fast to compute
- No problems with Vanishing Gradient Problem, since it doesn't saturate.

- Disadvantages:
- We can end up with ReLUs that are always inactive. If we get too many of these it hurts the performance of the network.
- This is called the “dying ReLU” problem.

- We can end up with ReLUs that are always inactive. If we get too many of these it hurts the performance of the network.

- Leaky ReLU
- This addresses the dying ReLU problem by using a slightly negative slope for values below 0.
- Output is ${y} \text{ if } y > 0, ay \text { for } y < 0$. Usually $a = 0.01$.

- Softplus
- $y = log(1 + e^x)$
- This looks kind of like a ReLU but increases with a smooth curve instead of abruptly changing the slope at 0.

- Gaussian Error Linear Unit (GeLU)
- $y = 0.5 * x * (1 + {erf}(x / \sqrt2))$ where $erf$ is the Gaussian error function.
- This is the activation model used by GPT and BART. Also a way to solve the dying ReLU problem.
- Unlike ReLU, and similar to the Leaky ReLU and SoftPlus, GeLU has a non-zero gradient for negative inputs. Unlike the Leaky ReLU, it is smooth around zero.
- The GeLU also converges to 1 for large inputs.

- ArgMax
- The raw output values in a network aren’t always between 0 and 1, especially with multiple outputs and more complex networks.
- ArgMax is a way to simplify interpreting the results. It sets the largest value of all the outputs to 1 and the others to 0.
- Big downside here is that we can’t backpropagate a network with an ArgMax.
- Also if two highest outputs are pretty close, we don’t see that.

- SoftMax
- Softmax attempts to solve the backpropagation problem faced by ArgMax.
- $softmax_n(outputs) = {e^{x_n}} / {\sum\limits_{i=1}^N e^{x_i}}$
- This transforms all the output values to a range within 0 to 1, and with the sum of the values equal to 1.
- This is kind of like a probability but shouldn’t actually be used as one.
- It takes the higher values and boosts them, while negatively boosting the lower values.

- We can then take the derivatives of these values to use them for gradient descent.
- It’s common to use ArgMax for running in production but SoftMax for training.

Thanks for reading! If you have any questions or comments, please send me a note on Twitter or Mastodon.

Please also consider subscribing to my weekly-ish newsletter, where I write short essays, announce new articles, and share other interesting things I've found.