Transition in the backward pass

This forum is inactive. Browsing/searching possible.

CS-433 Machine Learning

Connect your moderator Slack workspace to receive post notifications:

Bugs/improvements?

Transition in the backward pass

Hi,

I don't fully understand a part of the backward pass justification we are making in the course (class 8c):

Screenshot 2020-11-17 at 15.53.23.jpg

I don't understand the last step...

Thanks in advance for your help!

2

17 Nov '20 · 1 ·

anonymous

Hello David,

The last equation comes from the chaine rule for multivariable function. I will give it to you for a function of three variable but you can easily generalize it to any number of variable. For $f(x_1,x_2,x_3)$ a function from $ \mathbb R^3$ to $ \mathbb R$ and $g_1(y),g_2(y),g_3(y)$ three functions from $ \mathbb R$ to $ \mathbb R$ we have for $h(y)= f(g_1(y),g_2(y),g_3(y)$:

$$ \frac{\partial h}{\partial y} = \frac{\partial h}{\partial x_1} \frac{\partial g_1}{\partial y} + \frac{\partial h}{\partial x_2} \frac{\partial g_2}{\partial y} + \frac{\partial h}{\partial x_3} \frac{\partial g_3}{\partial y} $$

This is exactly what we are using here with the overloaded notation that $z_k^{(l+1)}$ is both the variable and a function of $z_j^{(l)}$.

Best,
Nicolas

5

17 Nov '20 · 2 ·

anonymous

That's exactly what I was missing, thank you so much!

17 Nov '20 ·

anonymous

This is an application of the chain rule of the gradient.

During the forward pass, $z_{j}^{(l)}$ depends on the previous activations (indexed by $i$, connected through weights $w_{i,j}^{(l)}$). In turn those activations depend on their previous activations and so on:

$$z_{j}^{(l)}=\sum_{i} w_{i, j}^{(l)} \phi\left(z_{i}^{(l-1)}\right)+b_{j}^{(l)}$$

During the backward pass, the gradient of the loss w.r.t. a certain $z_{j}^{(l)}$ depends on the following activations of it (indexed by $k$, connected through the weights $w_{j, k}^{(l+1)}$):

$$\frac{\partial \mathcal{L}_{n}}{\partial z_{j}^{(l)}}=\sum_{k} \frac{\partial \mathcal{L}_{n}}{\partial z_{k}^{(l+1)}} \frac{\partial z_{k}^{(l+1)}}{\partial z_{j}^{(l)}}$$

5

17 Nov '20 · 4 ·

anonymous

See also the answer of Arnout which is even more detailed :)

4

17 Nov '20 ·

anonymous

Hello,

I have a follow-up question: isn't $z_j^{(l)}$ the result of the linear transform before applying the activation function ? Moreover, shouldn't we use $x_i^{(l-1)}$ in the sum, as it is the output of the previous layer that is multiplied by the weights ?

17 Nov '20 ·

anonymous

isn't $z_{j}^{(l)}$ the result of the linear transform before applying the activation function? Moreover, shouldn't we use $x_{i}^{(l)}$ in the sum, as it is the output of the previous layer that is multiplied by the weights?

Correct. Thanks for noticing these typos; I have updated the comment with the correct symbols in accordance with the lecture notes.

2

17 Nov '20 · 3 ·

anonymous

Page 1 of 1

Add comment

How to style: strictly use the or click here. E.g., $\alpha + \beta$ gives (inline) $\alpha + \beta$. No $\LaTeX$ preview (yet).