Connect your moderator Slack workspace to receive post notifications:
Sign in with Slack

Lecture 11b differientiation

Hi,

Could you please explain how you compute the derivative of the surrogate on this slide ?

differientation.jpg

I can do it when µ_k is 1-dimensional but I can't do it for a multi-dimensional case. Also, could you please explain what is a Lagragian term and how to differientiate it ?

Top comment

Ignoring constants, the function that you want to maximize is \(\sum_n\sum_k q_{kn} \log \det (\Sigma_k^{-1}) - 1/2 (x_n-\mu_k)^\top \Sigma_k^{-1} (x_n-\mu_k)\) (for now I ignored also terms that depend on \(\pi\)).

The gradient with respect to \(-\mu_k\) is \(\sum_n q_{kn}\Sigma_k^{-1} (x_n-\mu_k)\), if you set it to zero you find the expression given in the lecture.

The gradient with respect to \(\Sigma_k^{-1}\) is \(\sum_n q_{kn} \Sigma_k - (x_n-\mu_k)(x_n-\mu_k)^\top\) if you set it to zero you find again the expression in the lecture.

Now for the optimization with respect to \(\pi\), because it is a probability we need to sort of penalize "possible solutions" that do not verify this condition, for this we add the term \(\lambda ( \sum_k \pi_k - 1)\) and optimize now over \(\lambda\) too. You see that if we derive wrt \(\lambda\) we get the condition \(\sum_k \pi_k = 1\) so we have incorporated the constraint into the objective function. Now derive with respect to \(\pi_k\) you get \(\sum_n q_{kn}\frac{1}{\pi_k} + \lambda = 0\), you solve plug this in the constraint and find the expression given in the lecture again.

Thank you so much for your answer.

How do you compute the gradient with respect to \sigma^-1 ? I managed to do the one with respect to µ_k by expanding the loss function but I considered 1 / (2pidet(sigma))^(1/2) as a constant.

The difficult part is the log det fastest way to derivate it is to look it up. What works for me in general with these derivatives is the basic definition : as \(h \mapsto 0\) \(f(x+h) \approx f(x) + \nabla f(x)^\top h\).

Thank you for your answer. I will try to apply it. If you have a detailed solution to find the derivative I would be more than happh.

If you have a function \(f(A) = x^\top A x\) then \(f(A+H) = f(A) + x^\top H x\) now you would have to write the term \(x^\top H x\) as an inner product to deduce the gradient for this we apply this simple but powerful fact \(Tr(AB)=Tr(BA)\). This gives : \(x^\top H x=Tr(x^\top H x)=Tr( H xx^\top)\) i.e the gradient is \((xx^\top)^\top=xx^\top\).

For the log det imagine you have a function \(f(A) = \log \det (A)\) with \(A\) invertible.
We write \(f(A+H)= \det (A) \det(I+A^{-1}H)\).
Now we use something that you see one time and forget, \(det(A) = \sum_{\sigma\in\S_n}\epsilon(\sigma)\Pi_i a_{i,\sigma(i)}\), we use this ugly formule to write \(\det(I+A^{-1}H) = 1 + Tr(A^{-1}H) + o(H)\). Put everything together you get \(\det(A+H)= det(A) + det(A)Tr(A^{-1}H)+ o(H)\), this means (I hope you see it) \(\nabla \det (A) = det(A)A^{-\top}\), now if you combine with the logarithm you get that the derivative of the function f at A is \(A^{-\top}\).
I hope things are easier to see now.

Thank you, it is much more clear now. However please don't put this in the exam :)

Page 1 of 1

Add comment

Post as Anonymous Dont send out notification