optimal learning rate

This forum is inactive. Browsing/searching possible.

Connect your moderator Slack workspace to receive post notifications:

Dear TAs,

We are a bit stuck because we know we need to implement gradient descent for our project however we don't know how to find out which learning rate to choose to converge fast enough and find a minimum but not too big that it will oscillate around a minimum and then eventually escape it and land on an asymptote with gradient close to 0?

What is done in practice?
thanks

1

8 Dec '20 ·

anonymous

Hi,

Something that I would do is to choose the largest initial learning rate that does not result in divergence. Then, after training few epochs when training loss (or even better validation accuracy) stops improving, I will multiply my learning rate by 0.1 and train until validation accuracy stops improving. If you perform 3-5 steps of training and reducing the learning rate, you’ll usually end up with a good enough solution.

2

10 Dec '20 ·

anonymous

Thank you,

Just to clarify by asymptote I meant horizontal asymptotes (it is a sum of least squares which has the properties of sum of sigmoids, we thus have a, non convex loss function) where the gradient is 0. SO doing gamma* gradient will result in barely any update of the weights. Large gamma may not make the loss diverge but will risk making the point go to one of these asymptotes and get stuck there, when maybe the minimum is somewhere closer to the origin (x=0). On the other hand small gammas may also be too small to make the point move if it starts off on one of these shallow asymptotes.

Thank you for your help

12 Dec '20 ·

anonymous

optimal learning rate
What is done in practice?

Similarly to what Mahdi explained with trying out a few learning rates by multiplying by a factor every time, you can automate this process further by trying to sweep the learning rate through 1 epoch (if 1 epoch takes up a considerable amount of time):
https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html

Large gamma may not make the loss diverge but will risk making the point go to one of these asymptotes and get stuck there, when maybe the minimum is somewhere closer to the origin (x=0). On the other hand small gammas may also be too small to make the point move if it starts off on one of these shallow asymptotes.

If you encounter many local minima and/or saddle points, a cyclical learning rate can help (not the only possible solution):
https://arxiv.org/abs/1506.01186

Learning rate annealing might also help to avoid early local minima, while still converging to usually better minima later on (i.e. without jumping over them).

15 Dec '20 ·

anonymous

Page 1 of 1

Add comment

How to style: strictly use the or click here. E.g., \(\alpha + \beta\) gives (inline) \(\alpha + \beta\). No \(\LaTeX\) preview (yet).