Connect your moderator Slack workspace to receive post notifications:
Sign in with Slack

Transition in the backward pass

Hi,

I don't fully understand a part of the backward pass justification we are making in the course (class 8c):

Screenshot 2020-11-17 at 15.53.23.jpg

I don't understand the last step...

Thanks in advance for your help!

Hello David,

The last equation comes from the chaine rule for multivariable function. I will give it to you for a function of three variable but you can easily generalize it to any number of variable. For \(f(x_1,x_2,x_3)\) a function from \( \mathbb R^3\) to \( \mathbb R\) and \(g_1(y),g_2(y),g_3(y)\) three functions from \( \mathbb R\) to \( \mathbb R\) we have for \(h(y)= f(g_1(y),g_2(y),g_3(y)\):

$$ \frac{\partial h}{\partial y} = \frac{\partial h}{\partial x_1} \frac{\partial g_1}{\partial y} + \frac{\partial h}{\partial x_2} \frac{\partial g_2}{\partial y} + \frac{\partial h}{\partial x_3} \frac{\partial g_3}{\partial y} $$

This is exactly what we are using here with the overloaded notation that \(z_k^{(l+1)}\) is both the variable and a function of \(z_j^{(l)}\).

Best,
Nicolas

That's exactly what I was missing, thank you so much!

This is an application of the chain rule of the gradient.

During the forward pass, \(z_{j}^{(l)}\) depends on the previous activations (indexed by \(i\), connected through weights \(w_{i,j}^{(l)}\)). In turn those activations depend on their previous activations and so on:

$$z_{j}^{(l)}=\sum_{i} w_{i, j}^{(l)} \phi\left(z_{i}^{(l-1)}\right)+b_{j}^{(l)}$$

During the backward pass, the gradient of the loss w.r.t. a certain \(z_{j}^{(l)}\) depends on the following activations of it (indexed by \(k\), connected through the weights \(w_{j, k}^{(l+1)}\)):

$$\frac{\partial \mathcal{L}_{n}}{\partial z_{j}^{(l)}}=\sum_{k} \frac{\partial \mathcal{L}_{n}}{\partial z_{k}^{(l+1)}} \frac{\partial z_{k}^{(l+1)}}{\partial z_{j}^{(l)}}$$

See also the answer of Arnout which is even more detailed :)

Hello,

I have a follow-up question: isn't \(z_j^{(l)}\) the result of the linear transform before applying the activation function ? Moreover, shouldn't we use \(x_i^{(l-1)}\) in the sum, as it is the output of the previous layer that is multiplied by the weights ?

isn't \(z_{j}^{(l)}\) the result of the linear transform before applying the activation function? Moreover, shouldn't we use \(x_{i}^{(l)}\) in the sum, as it is the output of the previous layer that is multiplied by the weights?

Correct. Thanks for noticing these typos; I have updated the comment with the correct symbols in accordance with the lecture notes.

Page 1 of 1

Add comment

Post as Anonymous Dont send out notification