I didn't quite understand the problem of having an activation function with vanishing gradient for very big values of |x|. How does it exactly affect the NN in a bad way? The rate of convergence of the SGD algorithm?
Hi,
The function expressed by your NN with L layers will be a composition of L activation functions (one by layer), therefore a product of L derivatives of activation function will appear in your gradient. If these derivatives are too small the gradient decreases exponentially with the number of layer and become arbitrarily small and therefore your algorithm will stay stuck and not move anymore (even if you aren't at a local minimum).
Vanishing gradient
Hi,
I didn't quite understand the problem of having an activation function with vanishing gradient for very big values of |x|. How does it exactly affect the NN in a bad way? The rate of convergence of the SGD algorithm?
Thanks,
Loïc Busson
Hi,
The function expressed by your NN with L layers will be a composition of L activation functions (one by layer), therefore a product of L derivatives of activation function will appear in your gradient. If these derivatives are too small the gradient decreases exponentially with the number of layer and become arbitrarily small and therefore your algorithm will stay stuck and not move anymore (even if you aren't at a local minimum).
Best,
Nicolas
1
I understand, thanks for this precision!
Best,
Loïc
Add comment