In problem about vanishing gradient of problem set 8, I obtained the bound 1/4 given by derivative of sigmoid, but I don't succeed to get the multiplication by 3 with my calculations despite the explications in the solutions. Can anyone help me please? Thanks in advance!
Since you have three nodes per layer you have to multiply the sigmoid's derivative (or here, maximum derivative) by three when you are doing the chain rule. Indeed, if you consider W(l) for the three elements it is, you'll see that if you take the element wise derivative of each layer, all three of the elements of W(l) depend on W(1,1), adding the factor of three in the solution.
But then the vanishing gradient would disappear if we had 4 nodes?
Hmm I didn't think of this, but I suppose the vanishing gradient is solved by having a large number of hidden nodes per layer. However, if you only had 4 nodes as you proposed, the vanishing gradient could still very much so be a problem, as the sigmoid's maximum gradient is 1/4, although it is most of the time at a smaller value than this. However, this would be much harder to prove mathematically rigorously, hence why they gave us 3 nodes per layer in the network. In real life though, even with a relatively high number of nodes (say 100), if your data gives you a small gradient per data point, it's possible that you get a vanishing gradient as the number of layers increases as the gradient might be too small that even if you are (effectively) multiplying it by 100. This is of course more difficult to understand or prove mathematically, but I hope it makes sense in terms of the intuition of the problem.