SVM Dual problem optimisation

This forum is inactive. Browsing/searching possible.

Connect your moderator Slack workspace to receive post notifications:

SVM Dual problem optimisation

When trying to find

$$max_{\alpha\in R^N} : \alpha^T1 - \frac{1}{2\lambda}\alpha YXX^TY\alpha$$

Why do we not derive with respect to $\alpha$ and set to zero instead of doing coordinate descent ? That would give something like

$$1 - \frac{1}{\lambda}YXX^TY\alpha = 0$$

$$\alpha = \lambda(YXX^TY)^{-1}1$$

And then maybe restrict the alphas to [0,1] or something.
And also, I have trouble understanding what it means to optimize the $\alpha_n$ one by one ? Does the equation become $\alpha_n - \frac{\alpha_n y_nx_nx_ny_n\alpha_n}{2\lambda}$ ? Then the solution would be $\frac{\lambda}{x_n^Tx_n}$ if I'm not mistaken ? It seems like it is always very small and also does not depend on $y_n$ since $y_n^2 = 1$ which is quite weird

Also other question : Lab 7 was very insistent on the fact that the primal objective is now a sum instead of a mean. What difference does it make since it is simply a division by a constant ?

27 Oct '20 · 3 ·

anonymous

Hello Hugo,

Thank you for your question. Indeed if we were looking at the first unconstrained problem you wrote down, we could have directly solved it in closed form by setting the gradient to zero.

However, unfortunately we are interested in the same problem but with the additional constraints that $ \alpha\in[0,1]$. If you look at how we derive the dual problem you will see that this constraint is very important, if $\alpha$ is unbounded then you will always obtain 0 or $\infty$ and not the hinge loss.

Then, in general, you will not have that if

$$ \alpha^* = \arg \min_{\alpha\in\mathbb R ^N} f(\alpha) $$

and

$$ \alpha^p = \arg \min_{\alpha\in [0,1]^N} f(\alpha) $$

that $\alpha^p$ is the projection of $\alpha^*$ on the constrained domain. It is a good exercice to find yourself a counter example. There are plenty of them.

Therefore you don't have any closed form solution for $\alpha^p$ and you need to use any optimization algorithm to find it. For this kind of structured problem the coordinate descent is well suited. You can find a formal definition of it in the lecture note. At each iteration, you sample randomly one dimension $I\in \{1,...,N\} $ and then you will exactly solve the problem for this coordinate (while the other coordinates stay fixed to their actual values).

Then you have done a mistake in the way you derived your equation because you cannot forget the cross terms $ \alpha_n y_nx_n^\top x_j y_j \alpha_j $ which will also be important in the optimization problem. You can ask the TAs tomorrow for the formal derivation (dont forget to solve it with the constraints).

Best,
Nicolas

2

28 Oct '20 · 2 ·

anonymous

Hi Hugo,
Conerning your last question:
"""" Also other question : Lab 7 was very insistent on the fact that the primal objective is now a sum instead of a mean. What difference does it make since it is simply a division by a constant ? """"
As you mention it doesn't make any "theoretical" difference since it is a division by a constant (the argmin solution is unchanged). However you will just have to be careful when using an optimisation algorithm such as SGD, the stochastic gradient of a sum is not the same as the stochastic gradient of an averaged sum.
Best,
Scott

28 Oct '20 ·

anonymous

28 Oct '20 · 2 ·

anonymous

Page 1 of 1

Add comment

How to style: strictly use the or click here. E.g., $\alpha + \beta$ gives (inline) $\alpha + \beta$. No $\LaTeX$ preview (yet).