Activation Functions
Written
— Updated
- An activation function is simply a function in a layer that acts on the input.
- Assuming input to the function is already scaled...
- Sigmoid
- This is the most commonly taught activation function, but it's not commonly seen in real systems, except sometimes at the output stage of a single-output network.
- The small range of the derivative reduces its usefulness, especially as more layers are added.
- ReLU
- Rectified Linear Unit
- This has been the most-used function in deep learning.
- Mostly just a passthrough, but output is clamped to positive values.
- Because the function is not smooth, we just fudge the derivative a little bit and say that at the point where is just 0 or 1.
- Advantages:
- Similar to biological neuron in that it's fixed until some minimum activation.
- Fast to compute
- No problems with Vanishing Gradient Problem, since it doesn't saturate.
- Disadvantages:
- We can end up with ReLUs that are always inactive. If we get too many of these it hurts the performance of the network.
- This is called the “dying ReLU” problem.
- We can end up with ReLUs that are always inactive. If we get too many of these it hurts the performance of the network.
- Leaky ReLU
- This addresses the dying ReLU problem by using a slightly negative slope for values below 0.
- Output is . Usually .
- Softplus
- This looks kind of like a ReLU but increases with a smooth curve instead of abruptly changing the slope at 0.
- Gaussian Error Linear Unit (GeLU)
- where is the Gaussian error function.
- This is the activation model used by GPT and BART. Also a way to solve the dying ReLU problem.
- Unlike ReLU, and similar to the Leaky ReLU and SoftPlus, GeLU has a non-zero gradient for negative inputs. Unlike the Leaky ReLU, it is smooth around zero.
- The GeLU also converges to 1 for large inputs.
- ArgMax
- The raw output values in a network aren’t always between 0 and 1, especially with multiple outputs and more complex networks.
- ArgMax is a way to simplify interpreting the results. It sets the largest value of all the outputs to 1 and the others to 0.
- Big downside here is that we can’t backpropagate a network with an ArgMax.
- Also if two highest outputs are pretty close, we don’t see that.
- SoftMax
- Softmax attempts to solve the backpropagation problem faced by ArgMax.
- This transforms all the output values to a range within 0 to 1, and with the sum of the values equal to 1.
- This is kind of like a probability but shouldn’t actually be used as one.
- It takes the higher values and boosts them, while negatively boosting the lower values.
- We can then take the derivatives of these values to use them for gradient descent.
- It’s common to use ArgMax for running in production but SoftMax for training.