Vanishing Gradient Problem

Written
  • When training a neural network and the derivative of the error function is too small, it can effectively stop updating, or make extremely slow progress.
  • This causes training to finish slowly or not at all.
  • This is a problem sometimes seen with Sigmoid activation functions, among others, due to their saturating nature at the extremes.
  • Faster hardware has helped with this issue, and other types of functions such as ReLU help as well.