Connect your moderator Slack workspace to receive post notifications:
Sign in with Slack

Vanishing gradient

Hi,

I didn't quite understand the problem of having an activation function with vanishing gradient for very big values of |x|. How does it exactly affect the NN in a bad way? The rate of convergence of the SGD algorithm?

Thanks,

Loïc Busson

Hi,
The function expressed by your NN with L layers will be a composition of L activation functions (one by layer), therefore a product of L derivatives of activation function will appear in your gradient. If these derivatives are too small the gradient decreases exponentially with the number of layer and become arbitrarily small and therefore your algorithm will stay stuck and not move anymore (even if you aren't at a local minimum).

Best,
Nicolas

I understand, thanks for this precision!

Best,
Loïc

Page 1 of 1

Add comment

Post as Anonymous Dont send out notification