Vanishing gradient

This forum is inactive. Browsing/searching possible.

Connect your moderator Slack workspace to receive post notifications:

Hi,

I didn't quite understand the problem of having an activation function with vanishing gradient for very big values of |x|. How does it exactly affect the NN in a bad way? The rate of convergence of the SGD algorithm?

Thanks,

Loïc Busson

5 Nov '20 ·

anonymous

Hi,
The function expressed by your NN with L layers will be a composition of L activation functions (one by layer), therefore a product of L derivatives of activation function will appear in your gradient. If these derivatives are too small the gradient decreases exponentially with the number of layer and become arbitrarily small and therefore your algorithm will stay stuck and not move anymore (even if you aren't at a local minimum).

Best,
Nicolas

1

5 Nov '20 ·

anonymous

I understand, thanks for this precision!

Best,
Loïc

10 Nov '20 ·

anonymous

Page 1 of 1

Add comment

How to style: strictly use the or click here. E.g., \(\alpha + \beta\) gives (inline) \(\alpha + \beta\). No \(\LaTeX\) preview (yet).