The last equation comes from the chaine rule for multivariable function. I will give it to you for a function of three variable but you can easily generalize it to any number of variable. For \(f(x_1,x_2,x_3)\) a function from \( \mathbb R^3\) to \( \mathbb R\) and \(g_1(y),g_2(y),g_3(y)\) three functions from \( \mathbb R\) to \( \mathbb R\) we have for \(h(y)= f(g_1(y),g_2(y),g_3(y)\):
This is an application of the chain rule of the gradient.
During the forward pass, \(z_{j}^{(l)}\) depends on the previous activations (indexed by \(i\), connected through weights \(w_{i,j}^{(l)}\)). In turn those activations depend on their previous activations and so on:
During the backward pass, the gradient of the loss w.r.t. a certain \(z_{j}^{(l)}\) depends on the following activations of it (indexed by \(k\), connected through the weights \(w_{j, k}^{(l+1)}\)):
I have a follow-up question: isn't \(z_j^{(l)}\) the result of the linear transform before applying the activation function ? Moreover, shouldn't we use \(x_i^{(l-1)}\) in the sum, as it is the output of the previous layer that is multiplied by the weights ?
isn't \(z_{j}^{(l)}\) the result of the linear transform before applying the activation function? Moreover, shouldn't we use \(x_{i}^{(l)}\) in the sum, as it is the output of the previous layer that is multiplied by the weights?
Correct. Thanks for noticing these typos; I have updated the comment with the correct symbols in accordance with the lecture notes.
Transition in the backward pass
Hi,
I don't fully understand a part of the backward pass justification we are making in the course (class 8c):
I don't understand the last step...
Thanks in advance for your help!
2
Hello David,
The last equation comes from the chaine rule for multivariable function. I will give it to you for a function of three variable but you can easily generalize it to any number of variable. For \(f(x_1,x_2,x_3)\) a function from \( \mathbb R^3\) to \( \mathbb R\) and \(g_1(y),g_2(y),g_3(y)\) three functions from \( \mathbb R\) to \( \mathbb R\) we have for \(h(y)= f(g_1(y),g_2(y),g_3(y)\):
$$ \frac{\partial h}{\partial y} = \frac{\partial h}{\partial x_1} \frac{\partial g_1}{\partial y} + \frac{\partial h}{\partial x_2} \frac{\partial g_2}{\partial y} + \frac{\partial h}{\partial x_3} \frac{\partial g_3}{\partial y} $$
This is exactly what we are using here with the overloaded notation that \(z_k^{(l+1)}\) is both the variable and a function of \(z_j^{(l)}\).
Best,
Nicolas
5
That's exactly what I was missing, thank you so much!
This is an application of the chain rule of the gradient.
During the forward pass, \(z_{j}^{(l)}\) depends on the previous activations (indexed by \(i\), connected through weights \(w_{i,j}^{(l)}\)). In turn those activations depend on their previous activations and so on:
$$z_{j}^{(l)}=\sum_{i} w_{i, j}^{(l)} \phi\left(z_{i}^{(l-1)}\right)+b_{j}^{(l)}$$
During the backward pass, the gradient of the loss w.r.t. a certain \(z_{j}^{(l)}\) depends on the following activations of it (indexed by \(k\), connected through the weights \(w_{j, k}^{(l+1)}\)):
$$\frac{\partial \mathcal{L}_{n}}{\partial z_{j}^{(l)}}=\sum_{k} \frac{\partial \mathcal{L}_{n}}{\partial z_{k}^{(l+1)}} \frac{\partial z_{k}^{(l+1)}}{\partial z_{j}^{(l)}}$$
5
See also the answer of Arnout which is even more detailed :)
4
Hello,
I have a follow-up question: isn't \(z_j^{(l)}\) the result of the linear transform before applying the activation function ? Moreover, shouldn't we use \(x_i^{(l-1)}\) in the sum, as it is the output of the previous layer that is multiplied by the weights ?
Correct. Thanks for noticing these typos; I have updated the comment with the correct symbols in accordance with the lecture notes.
2
Add comment