Activation Functions

Written 2023-04-30 — Updated 2023-05-01

An activation function is simply a function in a layer that acts on the input.
Assuming input to the function is already scaled... $x = mx' + b$
Sigmoid
- $1 / (1 + e^{-x})$
- This is the most commonly taught activation function, but it's not commonly seen in real systems, except sometimes at the output stage of a single-output network.
- The small range of the derivative reduces its usefulness, especially as more layers are added.
ReLU
- Rectified Linear Unit
- This has been the most-used function in deep learning.
- $y = max(x, 0)$
- Mostly just a passthrough, but output is clamped to positive values.
- Because the function is not smooth, we just fudge the derivative a little bit and say that $f'$ at the point where $x = 0$ is just 0 or 1.
- Advantages:
  - Similar to biological neuron in that it's fixed until some minimum activation.
  - Fast to compute
  - No problems with Vanishing Gradient Problem, since it doesn't saturate.
- Disadvantages:
  - We can end up with ReLUs that are always inactive. If we get too many of these it hurts the performance of the network.
    - This is called the “dying ReLU” problem.
Leaky ReLU
- This addresses the dying ReLU problem by using a slightly negative slope for values below 0.
- Output is ${y} \text{ if } y > 0, ay \text { for } y < 0$ . Usually $a = 0.01$ .
Softplus
- $y = log(1 + e^x)$
- This looks kind of like a ReLU but increases with a smooth curve instead of abruptly changing the slope at 0.
Gaussian Error Linear Unit (GeLU)
- $y = 0.5 * x * (1 + {erf}(x / \sqrt2))$ where $erf$ is the Gaussian error function.
- This is the activation model used by GPT and BART. Also a way to solve the dying ReLU problem.
- Unlike ReLU, and similar to the Leaky ReLU and SoftPlus, GeLU has a non-zero gradient for negative inputs. Unlike the Leaky ReLU, it is smooth around zero.
- The GeLU also converges to 1 for large inputs.
ArgMax
- The raw output values in a network aren’t always between 0 and 1, especially with multiple outputs and more complex networks.
- ArgMax is a way to simplify interpreting the results. It sets the largest value of all the outputs to 1 and the others to 0.
  - Big downside here is that we can’t backpropagate a network with an ArgMax.
  - Also if two highest outputs are pretty close, we don’t see that.
SoftMax
- Softmax attempts to solve the backpropagation problem faced by ArgMax.
- $softmax_n(outputs) = {e^{x_n}} / {\sum\limits_{i=1}^N e^{x_i}}$
- This transforms all the output values to a range within 0 to 1, and with the sum of the values equal to 1.
  - This is kind of like a probability but shouldn’t actually be used as one.
  - It takes the higher values and boosts them, while negatively boosting the lower values.
- We can then take the derivatives of these values to use them for gradient descent.
- It’s common to use ArgMax for running in production but SoftMax for training.