Could you please explain how you compute the derivative of the surrogate on this slide ?
I can do it when µ_k is 1-dimensional but I can't do it for a multi-dimensional case. Also, could you please explain what is a Lagragian term and how to differientiate it ?
Ignoring constants, the function that you want to maximize is \(\sum_n\sum_k q_{kn} \log \det (\Sigma_k^{-1}) - 1/2 (x_n-\mu_k)^\top \Sigma_k^{-1} (x_n-\mu_k)\) (for now I ignored also terms that depend on \(\pi\)).
The gradient with respect to \(-\mu_k\) is \(\sum_n q_{kn}\Sigma_k^{-1} (x_n-\mu_k)\), if you set it to zero you find the expression given in the lecture.
The gradient with respect to \(\Sigma_k^{-1}\) is \(\sum_n q_{kn} \Sigma_k - (x_n-\mu_k)(x_n-\mu_k)^\top\) if you set it to zero you find again the expression in the lecture.
Now for the optimization with respect to \(\pi\), because it is a probability we need to sort of penalize "possible solutions" that do not verify this condition, for this we add the term \(\lambda ( \sum_k \pi_k - 1)\) and optimize now over \(\lambda\) too. You see that if we derive wrt \(\lambda\) we get the condition \(\sum_k \pi_k = 1\) so we have incorporated the constraint into the objective function. Now derive with respect to \(\pi_k\) you get \(\sum_n q_{kn}\frac{1}{\pi_k} + \lambda = 0\), you solve plug this in the constraint and find the expression given in the lecture again.
How do you compute the gradient with respect to \sigma^-1 ? I managed to do the one with respect to µ_k by expanding the loss function but I considered 1 / (2pidet(sigma))^(1/2) as a constant.
The difficult part is the log det fastest way to derivate it is to look it up. What works for me in general with these derivatives is the basic definition : as \(h \mapsto 0\)\(f(x+h) \approx f(x) + \nabla f(x)^\top h\).
If you have a function \(f(A) = x^\top A x\) then \(f(A+H) = f(A) + x^\top H x\) now you would have to write the term \(x^\top H x\) as an inner product to deduce the gradient for this we apply this simple but powerful fact \(Tr(AB)=Tr(BA)\). This gives : \(x^\top H x=Tr(x^\top H x)=Tr( H xx^\top)\) i.e the gradient is \((xx^\top)^\top=xx^\top\).
For the log det imagine you have a function \(f(A) = \log \det (A)\) with \(A\) invertible.
We write \(f(A+H)= \det (A) \det(I+A^{-1}H)\).
Now we use something that you see one time and forget, \(det(A) = \sum_{\sigma\in\S_n}\epsilon(\sigma)\Pi_i a_{i,\sigma(i)}\), we use this ugly formule to write \(\det(I+A^{-1}H) = 1 + Tr(A^{-1}H) + o(H)\). Put everything together you get \(\det(A+H)= det(A) + det(A)Tr(A^{-1}H)+ o(H)\), this means (I hope you see it) \(\nabla \det (A) = det(A)A^{-\top}\), now if you combine with the logarithm you get that the derivative of the function f at A is \(A^{-\top}\).
I hope things are easier to see now.
Lecture 11b differientiation
Hi,
Could you please explain how you compute the derivative of the surrogate on this slide ?
I can do it when µ_k is 1-dimensional but I can't do it for a multi-dimensional case. Also, could you please explain what is a Lagragian term and how to differientiate it ?
Ignoring constants, the function that you want to maximize is \(\sum_n\sum_k q_{kn} \log \det (\Sigma_k^{-1}) - 1/2 (x_n-\mu_k)^\top \Sigma_k^{-1} (x_n-\mu_k)\) (for now I ignored also terms that depend on \(\pi\)).
The gradient with respect to \(-\mu_k\) is \(\sum_n q_{kn}\Sigma_k^{-1} (x_n-\mu_k)\), if you set it to zero you find the expression given in the lecture.
The gradient with respect to \(\Sigma_k^{-1}\) is \(\sum_n q_{kn} \Sigma_k - (x_n-\mu_k)(x_n-\mu_k)^\top\) if you set it to zero you find again the expression in the lecture.
Now for the optimization with respect to \(\pi\), because it is a probability we need to sort of penalize "possible solutions" that do not verify this condition, for this we add the term \(\lambda ( \sum_k \pi_k - 1)\) and optimize now over \(\lambda\) too. You see that if we derive wrt \(\lambda\) we get the condition \(\sum_k \pi_k = 1\) so we have incorporated the constraint into the objective function. Now derive with respect to \(\pi_k\) you get \(\sum_n q_{kn}\frac{1}{\pi_k} + \lambda = 0\), you solve plug this in the constraint and find the expression given in the lecture again.
1
Thank you so much for your answer.
How do you compute the gradient with respect to \sigma^-1 ? I managed to do the one with respect to µ_k by expanding the loss function but I considered 1 / (2pidet(sigma))^(1/2) as a constant.
The difficult part is the log det fastest way to derivate it is to look it up. What works for me in general with these derivatives is the basic definition : as \(h \mapsto 0\) \(f(x+h) \approx f(x) + \nabla f(x)^\top h\).
1
Thank you for your answer. I will try to apply it. If you have a detailed solution to find the derivative I would be more than happh.
If you have a function \(f(A) = x^\top A x\) then \(f(A+H) = f(A) + x^\top H x\) now you would have to write the term \(x^\top H x\) as an inner product to deduce the gradient for this we apply this simple but powerful fact \(Tr(AB)=Tr(BA)\). This gives : \(x^\top H x=Tr(x^\top H x)=Tr( H xx^\top)\) i.e the gradient is \((xx^\top)^\top=xx^\top\).
For the log det imagine you have a function \(f(A) = \log \det (A)\) with \(A\) invertible.
We write \(f(A+H)= \det (A) \det(I+A^{-1}H)\).
Now we use something that you see one time and forget, \(det(A) = \sum_{\sigma\in\S_n}\epsilon(\sigma)\Pi_i a_{i,\sigma(i)}\), we use this ugly formule to write \(\det(I+A^{-1}H) = 1 + Tr(A^{-1}H) + o(H)\). Put everything together you get \(\det(A+H)= det(A) + det(A)Tr(A^{-1}H)+ o(H)\), this means (I hope you see it) \(\nabla \det (A) = det(A)A^{-\top}\), now if you combine with the logarithm you get that the derivative of the function f at A is \(A^{-\top}\).
I hope things are easier to see now.
1
Thank you, it is much more clear now. However please don't put this in the exam :)
1
Add comment