# Activation Functions

Written — Updated
• An activation function is simply a function in a layer that acts on the input.
• Assuming input to the function is already scaled... $x = mx' + b$
• Sigmoid
• $1 / (1 + e^{-x})$
• This is the most commonly taught activation function, but it's not commonly seen in real systems, except sometimes at the output stage of a single-output network.
• The small range of the derivative reduces its usefulness, especially as more layers are added.
• ReLU
• Rectified Linear Unit
• This has been the most-used function in deep learning.
• $y = max(x, 0)$
• Mostly just a passthrough, but output is clamped to positive values.
• Because the function is not smooth, we just fudge the derivative a little bit and say that $f'$ at the point where $x = 0$ is just 0 or 1.
• Similar to biological neuron in that it's fixed until some minimum activation.
• Fast to compute
• No problems with Vanishing Gradient Problem, since it doesn't saturate.
• We can end up with ReLUs that are always inactive. If we get too many of these it hurts the performance of the network.
• This is called the “dying ReLU” problem.
• Leaky ReLU
• This addresses the dying ReLU problem by using a slightly negative slope for values below 0.
• Output is ${y} \text{ if } y > 0, ay \text { for } y < 0$. Usually $a = 0.01$.
• Softplus
• $y = log(1 + e^x)$
• This looks kind of like a ReLU but increases with a smooth curve instead of abruptly changing the slope at 0.
• Gaussian Error Linear Unit (GeLU)
• $y = 0.5 * x * (1 + {erf}(x / \sqrt2))$ where $erf$ is the Gaussian error function.
• This is the activation model used by GPT and BART. Also a way to solve the dying ReLU problem.
• Unlike ReLU, and similar to the Leaky ReLU and SoftPlus, GeLU has a non-zero gradient for negative inputs. Unlike the Leaky ReLU, it is smooth around zero.
• The GeLU also converges to 1 for large inputs.
• ArgMax
• The raw output values in a network aren’t always between 0 and 1, especially with multiple outputs and more complex networks.
• ArgMax is a way to simplify interpreting the results. It sets the largest value of all the outputs to 1 and the others to 0.
• Big downside here is that we can’t backpropagate a network with an ArgMax.
• Also if two highest outputs are pretty close, we don’t see that.
• SoftMax
• Softmax attempts to solve the backpropagation problem faced by ArgMax.
• $softmax_n(outputs) = {e^{x_n}} / {\sum\limits_{i=1}^N e^{x_i}}$
• This transforms all the output values to a range within 0 to 1, and with the sum of the values equal to 1.
• This is kind of like a probability but shouldn’t actually be used as one.
• It takes the higher values and boosts them, while negatively boosting the lower values.
• We can then take the derivatives of these values to use them for gradient descent.
• It’s common to use ArgMax for running in production but SoftMax for training.