Activation Functions

Written — Updated
  • An activation function is simply a function in a layer that acts on the input.
  • Assuming input to the function is already scaled... x=mx+bx = mx' + b
  • Sigmoid
    • 1/(1+ex)1 / (1 + e^{-x})
    • This is the most commonly taught activation function, but it's not commonly seen in real systems, except sometimes at the output stage of a single-output network.
    • The small range of the derivative reduces its usefulness, especially as more layers are added.
  • ReLU
    • Rectified Linear Unit
    • This has been the most-used function in deep learning.
    • y=max(x,0)y = max(x, 0)
    • Mostly just a passthrough, but output is clamped to positive values.
    • Because the function is not smooth, we just fudge the derivative a little bit and say that ff' at the point where x=0x = 0 is just 0 or 1.
    • Advantages:
      • Similar to biological neuron in that it's fixed until some minimum activation.
      • Fast to compute
      • No problems with Vanishing Gradient Problem, since it doesn't saturate.
    • Disadvantages:
      • We can end up with ReLUs that are always inactive. If we get too many of these it hurts the performance of the network.
        • This is called the “dying ReLU” problem.
  • Leaky ReLU
    • This addresses the dying ReLU problem by using a slightly negative slope for values below 0.
    • Output is y if y>0,ay for y<0{y} \text{ if } y > 0, ay \text { for } y < 0. Usually a=0.01a = 0.01.
  • Softplus
    • y=log(1+ex)y = log(1 + e^x)
    • This looks kind of like a ReLU but increases with a smooth curve instead of abruptly changing the slope at 0.
  • Gaussian Error Linear Unit (GeLU)
    • y=0.5x(1+erf(x/2))y = 0.5 * x * (1 + {erf}(x / \sqrt2)) where erferf is the Gaussian error function.
    • This is the activation model used by GPT and BART. Also a way to solve the dying ReLU problem.
    • Unlike ReLU, and similar to the Leaky ReLU and SoftPlus, GeLU has a non-zero gradient for negative inputs. Unlike the Leaky ReLU, it is smooth around zero.
    • The GeLU also converges to 1 for large inputs.
  • ArgMax
    • The raw output values in a network aren’t always between 0 and 1, especially with multiple outputs and more complex networks.
    • ArgMax is a way to simplify interpreting the results. It sets the largest value of all the outputs to 1 and the others to 0.
      • Big downside here is that we can’t backpropagate a network with an ArgMax.
      • Also if two highest outputs are pretty close, we don’t see that.
  • SoftMax
    • Softmax attempts to solve the backpropagation problem faced by ArgMax.
    • softmaxn(outputs)=exn/i=1Nexisoftmax_n(outputs) = {e^{x_n}} / {\sum\limits_{i=1}^N e^{x_i}}
    • This transforms all the output values to a range within 0 to 1, and with the sum of the values equal to 1.
      • This is kind of like a probability but shouldn’t actually be used as one.
      • It takes the higher values and boosts them, while negatively boosting the lower values.
    • We can then take the derivatives of these values to use them for gradient descent.
    • It’s common to use ArgMax for running in production but SoftMax for training.

Thanks for reading! If you have any questions or comments, please send me a note on Twitter.