the stochastic gradient descent works best with a big number of iterations (according to the law of large numbers)?
all the minima we compute using gradient descent are local around w(0)?
Does a gradient descent with N data point converges k time faster than a stochastical gradient descent? (where k could be determined analytically maybe based on the std deviation of the gradient based on 1 datapoint).
1) yes, in general, you would need more iterations with SGD compared to GD. (I don't know the link with the law of large numbers though). however the cost of a single iteration is much cheaper with SGD.
yes, most of the time we converge to the minima somewhere close to the w(0).
If you are minimizing a convex function, you will converge to a global minimizer.
If you are minimizing a non-convex and smooth function you will converge to a stationary point (a point where the gradient is equal to zero). You will have to do additional work and assumption to characterize the convergence to local minimum and to show that your algorithm will escape saddle point (even if it is often/always the case in practice)
To precisely compare the rate of convergence of SGD and GD you have to consider a class of function, (e.g. on non-smooth and convex function, both algorithms will converge at the same speed, but on smooth and strongly convex functions (when the eigenvalues of the Hessian are strictly larger than zero), GD will provably converge faster than SGD).
But what is sure, is that in its most used version, one step of SGD is often N times cheaper than computing the full gradient.
If you are interested in all these questions, you should follow our next lecture on optimization for ml. We will cover all these topics.
Thank you very much for the answers,
I would really enjoy following the lecture on Optimiztion for ML, but it is very unlikely as I am an exchange student at EPFL for only one semester.
You maybe have some books to recommend on the topic ?
Sure, you will find some materials (related courses and recommended books) on the course website (https://github.com/epfml/OptML_course). I particularly recommend the book of Sebastien Bubeck.
some questions regarding gradiant descent
Do I understand well saying:
Does a gradient descent with N data point converges k time faster than a stochastical gradient descent? (where k could be determined analytically maybe based on the std deviation of the gradient based on 1 datapoint).
thank you in advance for your answers
1) yes, in general, you would need more iterations with SGD compared to GD. (I don't know the link with the law of large numbers though). however the cost of a single iteration is much cheaper with SGD.
2
If you are minimizing a convex function, you will converge to a global minimizer.
If you are minimizing a non-convex and smooth function you will converge to a stationary point (a point where the gradient is equal to zero). You will have to do additional work and assumption to characterize the convergence to local minimum and to show that your algorithm will escape saddle point (even if it is often/always the case in practice)
To precisely compare the rate of convergence of SGD and GD you have to consider a class of function, (e.g. on non-smooth and convex function, both algorithms will converge at the same speed, but on smooth and strongly convex functions (when the eigenvalues of the Hessian are strictly larger than zero), GD will provably converge faster than SGD).
But what is sure, is that in its most used version, one step of SGD is often N times cheaper than computing the full gradient.
If you are interested in all these questions, you should follow our next lecture on optimization for ml. We will cover all these topics.
Cheers,
Nicolas
2
Thank you very much for the answers,
I would really enjoy following the lecture on Optimiztion for ML, but it is very unlikely as I am an exchange student at EPFL for only one semester.
You maybe have some books to recommend on the topic ?
Best regards,
Jean-Baptiste
Sure, you will find some materials (related courses and recommended books) on the course website (https://github.com/epfml/OptML_course). I particularly recommend the book of Sebastien Bubeck.
Best,
Nicolas
Add comment