Connect your moderator Slack workspace to receive post notifications:
Sign in with Slack

Inconsistency in notations for the optimization problems

Hi,
I really find it so confusing to understand when should we and when should not we add the term "N = size of training set" into our optimization function.

For example, In Linear Regression, we divided by N for the residuals only:
\(\mathcal{L}(w)=\frac{1}{2N} \sum_{i=1}^{N}(y_i-x_i^Tw)^2 + \lambda ||w||^2\)

\(\nabla\mathcal{L}(w)= \frac{1}{N} \sum_{i=1}^{N}(y_i-x_i^Tw)x_i + 2\lambda w \)

In Logistic Regression, we did not divide by N neither for the residuals nor the regularization term:
\(\mathcal{L}(w)= - \sum_{i=1}^{N} y_i \log(\sigma(x_i^Tw)) + (1-y_i) \log(1-\sigma(x_i^Tw))) + \lambda ||w||^2 \)

\(\nabla\mathcal{L}(w)= - \sum_{i=1}^{N} x_i( (\sigma(x_i^Tw) - y_i) + 2\lambda w \)

In SVM, we did not divide by N for the residuals, but we divided by N for the tradeoff parameter (similar to regularization in previous classifiers):
\(\mathcal{L}(w)= \sum_{i=1}^{N} [1-y_i x_i^T w]_{+} + \frac{\lambda}{2} ||w||^2 \)

\(\nabla \mathcal{L}(w)= \sum_{i=1}^{N} (-y_i x_i)1_{1-y_i x_i^T w >0} + \lambda w \)

I know that scaling does not change the solution (since the gradient direction remains the same, and only the norm changes thus controlling the speed of convergence), but my question is rather "which notation should I adopt to be consistent?

Another question, in Lab 7 (SVM), there was a note in Exercise 1 mentioning that we should multiply \(\nabla \mathcal{L}(w)\) by \(N\) because it is a sum not an average like previous SGD. I did not get what do you mean by this? Why should we multiply by N?

Top comment

Hi Hussein,

Thank you for your detailed question/comment and sorry for the inconsistency in the course.

Long story short, there is no real answer! All these formulations are exactly equivalent. And there is no standard definition in the literature:

  • If you think of your optimization objective as a cost function, then it makes sense to consider the sum of the losses, this is the total cost you will suffer when doing your prediction
  • if you think of your optimization objective as your empirical risk, then it makes sense to consider the average of the losses, i.e., the empirical expectation of your loss.

Both conventions are used in the literature and you will often need to juggle between them when you will do different projects or read different papers. You will be exposed to both objectives so you must get used to both of them. This explains why I am a bit reluctant to only adopt and present one convention during the course. Indeed, you may then be lost when you will face the other one.

What is important to note, is that it will impact the effect of the regularization parameter: if you use the same value of \(\lambda\) within the two formulations (with and without \(\frac{1}{N}\) in front of the sum of the individual losses) then you will not get the same regularization effect. This is something you have to be particularly careful of, in practice and which is often the cause of multiple errors in experiments.

To reply to your question about lab 7: you should think of SGD as a general optimization algorithm to minimize a function f. At each step you will follow a direction \(- g_t\) which should be, on average, in the direction of your negative true gradient \(- \nabla f(\theta_t)\):

$$ \mathbb E[ g_t] = \nabla f ( \theta_t). $$

If your objective is \( \sum_{i=1}^n f_i(\theta) \) then you should take \( g_t = n \nabla f_{i_t}(\theta_t)\), where \(i_t\) is uniform over \([1,n]\). Indeed you get

$$ \mathbb E[ g_t] = \sum_{i=1}^n \mathbb P \{i_t=i\} n \nabla f_{i}(\theta_t) = \sum_{i=1}^n \frac{1}{n}n \nabla f_{i}(\theta_t) = \sum_{i=1}^n \nabla f_{i}(\theta_t) = \nabla f(\theta_t). $$

If your objective is \(\frac{1}{n} \sum_{i=1}^n f_i(\theta) \) then you should take \( g_t = \nabla f_{i_t}(\theta_t)\), where \(i_t\) is uniform over \([1,n]\). Indeed you get

$$ \mathbb E[ g_t] = \sum_{i=1}^n \mathbb P \{i_t=i\} \nabla f_{i}(\theta_t) = \sum_{i=1}^n \frac{1}{n} \nabla f_{i}(\theta_t) = \frac{1}{n} \sum_{i=1}^n \nabla f_{i}(\theta_t) = \nabla f(\theta_t) $$

I hope it answered your question. Otherwise, let's ask it again during the next break and I'll give you more details.

Cheers,
Nicolas

Thank you so much for providing such a detailed and elegant explanation for both questions.
This is extremely helpful to understand the idea of computing the gradient in SGD under any setting. In summary, we need to make sure that our gradient \(g_t\) is computed such that the expectation of \(g_t\) overall the training examples is equal to the true gradient.

Thanks again!
Have a nice weekend.

Page 1 of 1

Add comment

Post as Anonymous Dont send out notification