Finding smoothness parameter of least squares

I'm trying to derive myself the smoothness parameter of the function

$$f(x) = \frac{1}{2n} \sum_{i=1}^{n} (a_i^\top x - b_i)^2 = \frac{1}{2n} \|Ax - b\|^2 $$

I think I've got it but I just don't understand properly how the final equality is obtained...

Use lemma 2.5 (sum of smooth functions)

$$ L_i = \frac{1}{2n} $$

$$ f_i = (a_i^\top x - b_i)^2 $$

$$ \nabla f_i = 2(a_i^\top x - b_i)a_i $$

$$ \lVert \nabla^2 f_i \rVert = 2 \lVert a_ia_i^\top \rVert =: L_i $$

Thus \( \nabla f_i \) is Lipschitz continuous and by lemma 2.4 \( f_i \) is smooth with parameter \( L_i \)

$$ L = \sum_{i=1}^{n} \lambda_iL_i = \frac{1}{n} \sum_{i=1}^{n} \lVert a_ia_i^\top \rVert = \frac{1}{n} \lVert A \rVert ^2 $$

Top comment

Hi,

The notebook indeed pointed out to Lemma 2.5, this is a mistake and it should point to Lemma 2.3 (now changed on github).

Indeed Lemma 2.3 from the lecture notes (which is proven in the week 2 lab) states that the smoothness of a quadratic with hessian \( Q \) is \( 2 \Vert Q \Vert_{spectral \ norm} \) -smooth. In your case, your function \( f \) is a quadratic whose hessian is \( \frac{1}{2 n } A^{\top} A \), hence it is \( \frac{1}{n } \Vert A^{\top} A \Vert \)-smooth.

It remains to notice that \( \Vert A^{\top} A \Vert = \Vert A \Vert^2 \). This is far from trivial, one way to see this is as follows:

$$\begin{aligned} \Vert A^T A \Vert &= \sup_{\Vert x \Vert \leq 1} \Vert A^T A x \Vert \\ &= \sup_{\Vert x \Vert \leq 1} \sup_{\Vert y \Vert \leq 1} \langle y, A^T A x \rangle \\ &= \sup_{\Vert x \Vert \leq 1} \sup_{\Vert y \Vert \leq 1} \langle A y, A x \rangle \\ &\overset{(a)}{=} \sup_{\Vert x \Vert \leq 1} \sup_{\Vert y \Vert \leq 1} \Vert A x \Vert \Vert A y \Vert \\ &= (\sup_{\Vert x \Vert \leq 1} \Vert A x \Vert ) ( \sup_{\Vert y \Vert \leq 1} \Vert A y \Vert ) \\ &= ( \sup_{\Vert x \Vert \leq 1} \Vert A x \Vert )^2 \\ &= \Vert A \Vert^2 \end{aligned}$$

Where the second equality comes from the fact that \( \Vert u \Vert = \sup_{\Vert y \Vert \leq 1} \langle y, u \rangle \) and \( (a) \) from Cauchy-Schwartz and the fact that it is attained for \( x = y \).

Using Lemma 2.5 as you did is possible but does not lead to a tight result. Indeed doing so you get a smoothness constant \( \frac{1}{n} \sum_i \Vert a_i a_i^T \Vert \) which is bigger (but not necessarily an equality) than \( \frac{1}{n} \Vert A \Vert \) (the fact that it is necessarily bigger comes from the triangular inequality).

Let me know if this is clear. Best.

Scott

This comment was deleted..
This comment was deleted..
This comment was deleted..
Top comment

Hi,

The notebook indeed pointed out to Lemma 2.5, this is a mistake and it should point to Lemma 2.3 (now changed on github).

Indeed Lemma 2.3 from the lecture notes (which is proven in the week 2 lab) states that the smoothness of a quadratic with hessian \( Q \) is \( 2 \Vert Q \Vert_{spectral \ norm} \) -smooth. In your case, your function \( f \) is a quadratic whose hessian is \( \frac{1}{2 n } A^{\top} A \), hence it is \( \frac{1}{n } \Vert A^{\top} A \Vert \)-smooth.

It remains to notice that \( \Vert A^{\top} A \Vert = \Vert A \Vert^2 \). This is far from trivial, one way to see this is as follows:

$$\begin{aligned} \Vert A^T A \Vert &= \sup_{\Vert x \Vert \leq 1} \Vert A^T A x \Vert \\ &= \sup_{\Vert x \Vert \leq 1} \sup_{\Vert y \Vert \leq 1} \langle y, A^T A x \rangle \\ &= \sup_{\Vert x \Vert \leq 1} \sup_{\Vert y \Vert \leq 1} \langle A y, A x \rangle \\ &\overset{(a)}{=} \sup_{\Vert x \Vert \leq 1} \sup_{\Vert y \Vert \leq 1} \Vert A x \Vert \Vert A y \Vert \\ &= (\sup_{\Vert x \Vert \leq 1} \Vert A x \Vert ) ( \sup_{\Vert y \Vert \leq 1} \Vert A y \Vert ) \\ &= ( \sup_{\Vert x \Vert \leq 1} \Vert A x \Vert )^2 \\ &= \Vert A \Vert^2 \end{aligned}$$

Where the second equality comes from the fact that \( \Vert u \Vert = \sup_{\Vert y \Vert \leq 1} \langle y, u \rangle \) and \( (a) \) from Cauchy-Schwartz and the fact that it is attained for \( x = y \).

Using Lemma 2.5 as you did is possible but does not lead to a tight result. Indeed doing so you get a smoothness constant \( \frac{1}{n} \sum_i \Vert a_i a_i^T \Vert \) which is bigger (but not necessarily an equality) than \( \frac{1}{n} \Vert A \Vert \) (the fact that it is necessarily bigger comes from the triangular inequality).

Let me know if this is clear. Best.

Scott

Small question regarding Scott's answer. Isn't the hessian actually

$$ \frac{1}{n} A^T A $$

?

Add comment

Post as Anonymous Dont send out notification