Connect your moderator Slack workspace to receive post notifications:
Sign in with Slack

Constrained optimization & feature normalization

Thank you so much for answering all my previous questions :)

I have the following questions from the second part of the lecture:

  1. What happens when the minimum of a convex function lies in the constrained area? Is it possible to somehow detect that the algorithm is stuck? How do constrains affect convergence?

  2. Is regularization (l1,l2) an example of constrained optimization?

  3. Regarding the advice to normalize input features, should we always normalize them (values between 0,1) or it is sometimes better to standardize the features (mean 0, std 1)?

Hi,

"What happens when the minimum of a convex function lies in the constrained area?"

Then it's an easy case since then the constrained optimization problem is equivalent to an unconstrained one. And you can still run any constrained optimization method (e.g., projected gradient descent) and it will give you convergence to the optimum given the right choice of the step size.

"Is it possible to somehow detect that the algorithm is stuck?"

For convex optimization problems, many widely used optimization algorithms enjoy convergence guarantees (again, if the step size or other similar parameters are correctly set) and cannot get stuck. For non-convex optimization (e.g. training neural networks), it's indeed possible to "get stuck" in the sense to converge to a local minimum or a saddle point. I think in general it's not possible to detect that you are, say, at a local minimum when having no information about how good should be the global minimum. However, if you know that your global minimum is, say, very close to 0 (typical for training neural networks), but you converge (i.e. the gradient norm becomes very small) to a much higher value of the objective, then it can indicate that the algorithm "got stuck".

"How do constrains affect convergence?"

Depends on the particular method. For example, projected gradient descent has the same convergence rate as usual gradient descent (e.g., see these slides https://ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture16.pdf)

"Is regularization (l1,l2) an example of constrained optimization?"

No since in this case we have L1 or L2 penalty on the weights directly in the objective, and there will be no constraints. In other words the vector w can still take any value in R^d without being constraint to any subset of R^d.

"Regarding the advice to normalize input features, should we always normalize them (values between 0,1) or it is sometimes better to standardize the features (mean 0, std 1)?"

As mentioned in the lecture, one goal of normalization / standardization is to reduce the condition number (https://en.wikipedia.org/wiki/Condition_number) of the problem to speed up gradient descent. Taking this into account, one should select the preprocessing technique that better reduces the condition number. For example, Section 5.3 in http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf suggests that subtracting the mean (which is not done if one just keeps the values in [0, 1]) is beneficial to reduce the condition number. But I'm not sure if standardization will always lead to a smaller condition number. I would expect that in practice, both methods will perform roughly the same.

I hope that helps,
Maksym

Page 1 of 1

Add comment

Post as Anonymous Dont send out notification