Connect your moderator Slack workspace to receive post notifications:
Sign in with Slack

Lab 8 theory question

Hi,

Can you please give me more details about the last part of the solution ?

" It remains to bound the inner derivative for each such function. Note that by assumption each weight has magnitude at most 1 and we assumed that we have K = 3, i.e., we have only three nodes per layer. Therefore, we get at most a factor 3 from the inner derivative. This proves the claim."

I guess that the inner derivative is the derivative of W(l) @ x(l−1) + b_(l) with respect to W_1,1
Isn't it equal to x1(l-1) ? Where does the weight appear ?

Best regards,

Ali

With inner derivatives, the solution is talking about quantities like:

d(output of layer) / d(input of layer)

They are w.r.t. the layer inputs, rather than with respect to the layer's weights.

This is because

d(loss) / d(weights of layer 1) = d(output of layer 1) / d(weights of layer 1) * d(loss) / d(output of layer 1).

The second term here will be a product of terms like d(output of layer) / d(input of layer)

Thank you for your answer, I think I understood the point.

I did the computation on my side and could not reproduce the result.
Also, if we claim that the gradient vanishes with a rate of (K/4)^L then why couldn't we set K = 4 ?

We know that \(d(loss) / d(output_1)\) is a product of \(d(output_l) / d(inputlayer_l)\) thanks to the chain rule.

However, the input is supposed to be a R^3 vector so I would guess that each one of the \(d(output_l) / d(input_l)\) is supposed to be a matrix except for the very last layer.

Best regards,

Ali

Page 1 of 1

Add comment

Post as Anonymous Dont send out notification