Can you please give me more details about the last part of the solution ?
" It remains to bound the inner derivative for each such function. Note that by assumption each weight has magnitude at most 1 and we assumed that we have K = 3, i.e., we have only three nodes per layer. Therefore, we get at most a factor 3 from the inner derivative. This proves the claim."
I guess that the inner derivative is the derivative of W(l) @ x(l−1) + b_(l) with respect to W_1,1
Isn't it equal to x1(l-1) ? Where does the weight appear ?
I did the computation on my side and could not reproduce the result.
Also, if we claim that the gradient vanishes with a rate of (K/4)^L then why couldn't we set K = 4 ?
We know that \(d(loss) / d(output_1)\) is a product of \(d(output_l) / d(inputlayer_l)\) thanks to the chain rule.
However, the input is supposed to be a R^3 vector so I would guess that each one of the \(d(output_l) / d(input_l)\) is supposed to be a matrix except for the very last layer.
Lab 8 theory question
Hi,
Can you please give me more details about the last part of the solution ?
" It remains to bound the inner derivative for each such function. Note that by assumption each weight has magnitude at most 1 and we assumed that we have K = 3, i.e., we have only three nodes per layer. Therefore, we get at most a factor 3 from the inner derivative. This proves the claim."
I guess that the inner derivative is the derivative of W(l) @ x(l−1) + b_(l) with respect to W_1,1
Isn't it equal to x1(l-1) ? Where does the weight appear ?
Best regards,
Ali
With inner derivatives, the solution is talking about quantities like:
d(output of layer) / d(input of layer)
They are w.r.t. the layer inputs, rather than with respect to the layer's weights.
This is because
d(loss) / d(weights of layer 1) = d(output of layer 1) / d(weights of layer 1) * d(loss) / d(output of layer 1).
The second term here will be a product of terms like
d(output of layer) / d(input of layer)
2
Thank you for your answer, I think I understood the point.
I did the computation on my side and could not reproduce the result.
Also, if we claim that the gradient vanishes with a rate of (K/4)^L then why couldn't we set K = 4 ?
We know that \(d(loss) / d(output_1)\) is a product of \(d(output_l) / d(inputlayer_l)\) thanks to the chain rule.
However, the input is supposed to be a R^3 vector so I would guess that each one of the \(d(output_l) / d(input_l)\) is supposed to be a matrix except for the very last layer.
Best regards,
Ali
1
Add comment