Connect your moderator Slack workspace to receive post notifications:
Sign in with Slack

Primal formula

Hi,
I don't understand why in the videos we've got this formula
Screenshot from 2020-11-03 11-00-40.jpg

And the course notes we've got this

Screenshot from 2020-11-03 11-01-03.jpg

This leads to "wrong" results in the code but also to quite some confusion...

I'm also a little confused as to why in the exercise we multiply the gradient by "num_features "

Screenshot from 2020-11-03 11-05-24.jpg

Thanks a lot for clarifying !

Waouh, sorry, quite a bug with my computer

Hello,
I invite you to come to the correction of the exercises on zoom this afternoon. All this should be made clear.
Best,
Scott

Hey,

As he said, Scott will explain all the details in today solution session. But let me do a quick remark since this is far more general than SVM.
When you are considering a regularized objective your objective function is the sum of a function \(f\) and a regularization term \(R\) (which is also a function of your parameters, here \(R(w) = \|w\|_2^2\) ). And you are balancing this sum with a regularization parameter \( \lambda \).

If \( \lambda = 0\) then you are only minimizing the function \(f\) (the regularization disappears) and if \( \lambda = \infty\) then you are only minimizing the regularization term (the function \(f\) disappears from the objective). Then you will take a parameter \( \lambda \) which leads to a good tradeoff between the two terms (such as you are minimizing a little bit of both functions).

So why this is relevant to your question? Because the two problems you have written are very close: they just correspond to two problems with different regularization parameters.

$$ \arg\min \frac{1}{N} \sum_{i=1}^N [ 1-y_i x_i^\top w]_+ + \frac{\lambda}{2} \|w\|_2^2 = \arg\min \sum_{i=1}^N [ 1-y_i x_i^\top w]_+ + \frac{\tilde \lambda}{2} \|w\|_2^2,$$

where \(\tilde \lambda = N\lambda \).

Best,
Nicolas

Waouh, good call !

Thanks for the clear explanation and the quick answer

hello, given the two problems are the same with a different regularization parameter: for the ex07 could we have minimized instead the left hand side and in that case there would have been no need to multiply the stochastic gradient by N? we would just have had to modify the lambda term correct?

Hi,
This is correct.
Best, Scott

Page 1 of 1

Add comment

Post as Anonymous Dont send out notification