Connect your moderator Slack workspace to receive post notifications:
Sign in with Slack

Why standardise?

Hello.

Quick and silly question: why do we standardise data? I agree that it doesn't hurt, but how does it help?

Thanks.
Best.

Elia

One important reason is that we want the features to be all on similar scales. Otherwise while fitting a model, the features that have a larger variance, just because of the units they are measured in, will disproportionately influence the fitting.

Agreed. Maybe to add a bit, another point of view is to see it from the convergence analysis of gradient descent. There was a small part about it in the lecture:
Feature normalization and preconditioning: Gradient descent is very sensitive to ill-conditioning. Therefore, it is typically advised to normalize your input features. In other words, we pre-condition the optimization problem. Without this, step-size selection is more difficult since different “directions” might converge at different speed.

A bit more details about the condition number can be found, e.g., here (page 4) where the message is that the convergence rate directly depends on the condition number. E.g., for the least squares problem, it will be the ratio between the largest and smallest eigenvalues of the matrix X^T X (i.e. the Hessian). Then standardization is useful since it can reduce the condition number by scaling the columns of X.

I'm sorry, I can't see how a feature \(d\) measured in smaller units, thus having larger variance, will influence the weights other than \(w_d\).
I mean, say we have \(N\) 0-mean (for simplicity) data points with arbitrary variance, and the corresponding optimal weight vector \(w^*\): if we multiply each point's \(d\)-th feature by \(k\), all that happens is that \(w^*_d\) gets multiplied by \(\frac{1}{k}\).
That's why I don't see how different units are a problem: because the weights also have units, so any scaling we do on the points will just result in the inverse scaling being applied to the optimal weights.
In this sense I don't understand where the problem is in having different features converging at different speeds: they do converge at the same speed, in units of standard devs (I think).

I have this horrible feeling where I know I'm wrong, but I don't know how.

You are totally right that we can obtain the optimal \(w^*_d\) from \(w^*\) just by scaling each feature by a constant. However, this doesn't imply that gradient descent will have the same behaviour on the two objectives (one with and the other one without normalization).

Maybe to get a bit more intuition about the condition number and how it affects the behaviour of gradient descent, you can also check out slides 19 - 26 here (a good example on a quadratic function and some visualizations): http://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/slides/lec07.pdf.

I hope that helps.

Page 1 of 1

Add comment

Post as Anonymous Dont send out notification