Hi,
I am confused by the fact that this question is True, standard GD is a first order optimization method, given that our function is only smooth and not necessarily convex, we don't have a lower bound for the worst case, right? Also if the function was convex and smooth the rate of convergence would be O(1/eps), isn't this worst than O(1/sqrt(eps)), what am I doing wrong?
Thanks

Hi,
The worst case for first order methods for smooth and convex functions is \( \Omega ( \frac{1}{\sqrt{\varepsilon}} ) \) (see section "2.6 Acceleration for smooth convex" in the lecture notes), a posteriori the worst case for any smooth functions is also \( \Omega ( \frac{1}{\sqrt{\varepsilon}} ) \) since it is a bigger class which includes the convex functions.

As you say, for convex and smooth function GD has iteration complexity of \( O (\frac{1}{\varepsilon}) \) which is not optimal since \( \frac{1}{\varepsilon} > \frac{1}{\sqrt{\varepsilon}} \). You saw in class that accelerated GD reaches the lower bound, and hence is optimal.

I agree with the original post, in that the correct answer should be False.

I understand we consider \(\mathcal{O}(1/\sqrt{\epsilon})\) to be the worst case overall, as the smart choice is to use accelerated gradient descent.

However, the question clearly asks for the worst case among Every first order optimization method, which leads me to read "the worst worst case convergence rates among all the first-order methods we saw in the course". A quick look at table 2.1 of the lecture notes shows that this is \(\mathcal{O}(1/\epsilon)\), which means that the statement is False.

I know the exam has already been printed, but I hope we won't see ambiguous questions as this one.

I am confused too, and would be ok with Alejandro remark.

I also don't understand the implication in the sentence :

"The worst case for first order methods for smooth and convex functions is Ω(1√ε) (see section "2.6 Acceleration for smooth convex" in the lecture notes), a posteri the worst case for any smooth functions is also Ω(1√ε) since it is a bigger class which includes the convex functions"

I don't see why we could deduce something about non-convex (and still smooth) functions?

Though we pay attention in order that it does not occur, it may happen that exams contain some ambiguous questions and we apologise. However, as it is the case here, you should probably start by asking yourself whether you indeed understood the question correctly or not before claiming that the question is ambiguous.

My original answer was maybe not clear enough. The source of the confusion seems to come from the notations \( O( \cdot ) \) and \( \Omega(\cdot ) \) which do not have the same meaning. When we speak of lower bounds it does not make sense to use the notation \( O( \cdot) \). You saw in the lectures that for any first order method, there will always exist a smooth and convex function such that you need at least\( \Omega( \frac{1}{ \sqrt{\varepsilon} } )\) iterations in order to reach accuracy \( \varepsilon \). A posteriori, this immediately says that there will always exist a smooth function (since this class is bigger than the smooth and convex class) such that you nead at least\( \Omega( \frac{1}{ \sqrt{\varepsilon} }) \) iterations in order to reach accuracy \( \varepsilon \).

The upperbounds in table 2.1 say things which are very different since they are convergence upperbounds for specific algorithms and for certain classes of functions (which is very different from worst case bounds). For instance this table says that for any smooth and convex function, you need at most (and not at least !) \( O( \frac{1}{\varepsilon}) \) iterations to reach accuracy \( \varepsilon \). However it says nothing on the worst case complexity of gradient descent or even the worst case complexity of any first order algorithm.

I hope this makes it clear. Good luck on the exam, I hope there will not be any ambiguous question but please start by asking yourself if you actually understood it correctly first.

## Q19 2020

Hi,

I am confused by the fact that this question is True, standard GD is a first order optimization method, given that our function is only smooth and not necessarily convex, we don't have a lower bound for the worst case, right? Also if the function was convex and smooth the rate of convergence would be O(1/eps), isn't this worst than O(1/sqrt(eps)), what am I doing wrong?

Thanks

Hi,

The worst case for first order methods for smooth and convex functions is \( \Omega ( \frac{1}{\sqrt{\varepsilon}} ) \) (see section "2.6 Acceleration for smooth convex" in the lecture notes), a posteriori the worst case for any smooth functions is also \( \Omega ( \frac{1}{\sqrt{\varepsilon}} ) \) since it is a bigger class which includes the convex functions.

As you say, for convex and smooth function GD has iteration complexity of \( O (\frac{1}{\varepsilon}) \) which is not optimal since \( \frac{1}{\varepsilon} > \frac{1}{\sqrt{\varepsilon}} \). You saw in class that accelerated GD reaches the lower bound, and hence is optimal.

Best.

Scott

## 1

Hello,

I agree with the original post, in that the correct answer should be

False.I understand we consider \(\mathcal{O}(1/\sqrt{\epsilon})\) to be the worst case overall, as the

smartchoice is to use accelerated gradient descent.However, the question clearly asks for the worst case among

Every first order optimization method, which leads me to read"the worst worst case convergence rates among all the first-order methods we saw in the course". A quick look at table 2.1 of the lecture notes shows that this is \(\mathcal{O}(1/\epsilon)\), which means that the statement isFalse.I know the exam has already been printed, but I hope we won't see ambiguous questions as this one.

## 3

Hello,

I am confused too, and would be ok with Alejandro remark.

I also don't understand the implication in the sentence :

"The worst case for first order methods for smooth and convex functions is Ω(1√ε) (see section "2.6 Acceleration for smooth convex" in the lecture notes), a posteri the worst case for any smooth functions is also Ω(1√ε) since it is a bigger class which includes the convex functions"

I don't see why we could deduce something about non-convex (and still smooth) functions?

Hi,

Though we pay attention in order that it does not occur, it may happen that exams contain some ambiguous questions and we apologise. However, as it is the case here, you should probably start by asking yourself whether you indeed understood the question correctly or not before claiming that the question is ambiguous.

My original answer was maybe not clear enough. The source of the confusion seems to come from the notations \( O( \cdot ) \) and \( \Omega(\cdot ) \) which do not have the same meaning. When we speak of lower bounds it does not make sense to use the notation \( O( \cdot) \). You saw in the lectures that

for anyfirst order method, there will always exist a smooth and convex function such that you needat least\( \Omega( \frac{1}{ \sqrt{\varepsilon} } )\) iterations in order to reach accuracy \( \varepsilon \). A posteriori, this immediately says that there will always exist a smooth function (since this class is bigger than the smoothandconvex class) such that you neadat least\( \Omega( \frac{1}{ \sqrt{\varepsilon} }) \) iterations in order to reach accuracy \( \varepsilon \).The upperbounds in table 2.1 say things which are very different since they are convergence

upperboundsfor specific algorithms and for certain classes of functions (which is very different from worst case bounds). For instance this table says thatfor anysmooth and convex function, you needat most(and not at least !) \( O( \frac{1}{\varepsilon}) \) iterations to reach accuracy \( \varepsilon \). However it says nothing on the worst case complexity of gradient descent or even the worst case complexity of any first order algorithm.I hope this makes it clear. Good luck on the exam, I hope there will not be any ambiguous question but please start by asking yourself if you actually understood it correctly first.

Scott

## 2

## Add comment