l1 regularization for features vs l2

This forum is inactive. Browsing/searching possible.

Connect your moderator Slack workspace to receive post notifications:

l1 regularization for features vs l2

Hi,
why is it a "better" /"good" idea to use l1 regularization as a feature selector rather than l2. This was discussed during the q&a of today but not sure I understand why.

Thanks in advance for your answer

20 Oct '20 · 1 ·

anonymous

In the picture below:

[right] L1 regularization has a sharp (equi-)loss contour, which makes it likely that the optimal regularized loss lies on the axis. Lying on the axis = a sparse weight vector (some weights are 0) thus doing feature selection.
[left] L2 regularization will encourage the weight parameter vector to be on an axis only when the (non-regularized) optimum is on one of the axes itself. This rarely happens compared to all the other options in the weight space. Note that L2 regularization, while having less of a feature selection effect, is more stable when features are correlated (see optional reading).

Screen Shot 2020-10-20 at 19.13.37.jpg

image credit + graphical intuition: https://www.youtube.com/watch?v=sO4ZirJh9ds

Optional reading: https://blog.datadive.net/selecting-good-features-part-ii-linear-models-and-regularization/

4

20 Oct '20 ·

anonymous

Thank you for this! It is very clear :))

I have one extra question:
If we transform our data with PCA (data is decorrelated), then both L1 and L2 are stable, correct?
And using L1 could be a way to choose the number of first components

6 Jan '21 ·

anonymous

Page 1 of 1

Add comment

How to style: strictly use the or click here. E.g., \(\alpha + \beta\) gives (inline) \(\alpha + \beta\). No \(\LaTeX\) preview (yet).