Bad initialization of weights of a neural network makes it useless


I hope my question does not bother you, since the class is over. I tried to find a "formal" answer online to my question, but only found handwavy/intuitive answers that does not help me understand what is going on.

I was playing with neural network lately when I read online that initializing all weights to the same value makes it unable to learn correctly. Therefore, I implemented a basic network in pytorch and used the adam optimizer with neg. loglikelihood loss (that was for a classification on MNIST). Indeed, the network was useless and did not learn anything. I used ReLU activation, I know we have to be careful as the gradient for negative input is zero. To avoid this case, I initialized all weights to 1, and since the MNIST pixel encoding is in [0, 1], everything seems fine.

Then, I tried to understand why it did not work by starting from a simple case, I did a linear regression (which is basically a neural net with constant activation and 1 layer) and initialized the weights to zero and using the MSE loss. This time I did not use an optimizer but updated the weights using SGD "by hand". As one can expect by writing the expression for the derivative with respect to the MSE, the gradient is non-zero in general and indeed the "network" is able to learn the ground truth weights used to generate the data.

Now I don't know what to do to find where the flaw comes from. Is it the Adam optimizer that struggles ?

Thanks for your help

this is a good question, and indeed the initialization has a huge effect in deep learning in practice, as opposed to convex models (such as logistic regression) where we know that it will provably converge no matter from which initialization. the main reason is the fact that DL is non-convex.

for deep nets, there are many papers written to recommend particular initializations (for example for CNNs, Xavier initialization is very common). random is already much better than a constant initialization, but not as good as even more tailored ones.
if you use a different initialization then you also have to tune again the stepsize of your optimizer (no matter if adam or SGD), as it might cause exploding gradients or slow convergence.

BTW even from bad initializations, the optimizers will still work in the sense that they slowly go to a flat region, it just might take unreasonably long, or could also go to a bad local minimum potentially (you might have observed the first case)

Thanks for your answer.

The main thing I find disturbing is that all the weights (not biases) stay the same... They keep their initial value of one and only the biases of the last layer change (the ones from the first layer are all equal but non-zero)...

The architecture is a linear layer (784 -> 2000) followed by a leaky relu, followed by a linear layer (2000 -> 10) finishing with a log softmax.I tried a few architecture and the problem stayed the same. If instead of constant initialization, I use random values, the problem completely disappears.

That's why I wondered if you knew about a reason for that, it seems more than a coincidence that the weights stay at 1... One observation to make in the forward pass, is that if all weights are equal to one, each "output" of the first layer is the same (sum of entries of input vector), hence the second layer receives the same input value in each neuron, irrespective of the input. However, I don't see why during backpropagation no update occurs. By checking with the expressions we derived during the lectures, it seems to me that the gradient should not be equal for each parameter...

Page 1 of 1

Add comment

Post as Anonymous Dont send out notification