Question on gaussian mixture models - soft clustering, role of pi

This forum is inactive. Browsing/searching possible.

Connect your moderator Slack workspace to receive post notifications:

Question on gaussian mixture models - soft clustering, role of pi

Hello, I have 3 questions that are all quite related to each other, as they can have very similar responses/implications. As such, there's no precise order to them.

1) In today's lecture we defined \(z_n\) following a multinomial distribution \(\pi\), saying that this enables for soft-clustering instead of hard assignments. However, all the \(z_n\) follow the same \(\pi\) distribution, which simply acts as a prior on the "assignment" to a certain gaussian \(k\). In the formula for the joint distribution of a GMM, in fact, we have the likelihood of \(x_n\) given \(z_n\) times the probability to have that \(z_n\). If we don't marginalise, we can maximise this joint distribution to find the best model parameters for a given \(x\) and \(z\) (or considering \(z\) as a model parameter and optimise for it as well, I don't think it matters here), which then means to have a hard assignment. Where does the claim of soft-clustering comes from? I can't understand it, it seems a hard-assignment with a prior.
Relating to question 2, if we consider \(\pi\) as a model parameter instead of a prior, the assignment of the \(z\) still appears hard to me, it is simply conditioned on an "optimisable" distribution instead of a fixed prior.
Relating to question 3, for a soft assignment to be produced we don't specifically need a separate distribution for each \(z_n\), as I say in question 3 for generalisation purposes. The marginalisation that we do, does in fact produce a soft-clustering from my understanding. Is this correct?

2) Since \(\pi\) acts as a prior, we have it in the aforementioned joint distribution, and we also have it in the marginalised version. However, when we optimise for that cost function, we consider \(\pi\) as part of the model parameters, thus we also optimise for it. Why?
I mean, I am totally fine with it because it produces a soft-clustering (in my understanding), but what bothers me is that we are treating it both as a prior and as a parameter.

3) Set aside the marginalisation, why aren't we using \(\pi\) normally as a prior (and not a model parameter at all) and defining instead each of the \(z_n\) as a separate multinomial distribution and considering them as model parameters (like they are in K-means)? In this way we can have a more powerful and general model (but of course more difficult to optimise and with more parameters), which is also "soft" even without the marginalisation, and also has a nice prior to play with. When we saw K-means I thought that this would be the most natural generalisation of it.

Thank you, I hope the questions are clear!

30 Nov '21 ·

anonymous

Top comment

It seems you have a problem with the meaning of prior, a prior (in bayesian statistics) is a distribution that encodes our belief about some parameter (that defines a model of how a given system should behave) before seeing the data. In this sense Pi is not a prior but just a distribution that says basically what is the size of the clusters.
As for why it is called a soft assignment, because for a given data point we don't know exactly what cluster it is in, rather we model the probability that it will be in some given cluster (from a quantum mechanics point of view, we would say that the data point is in all the clusters until we make an observation and then it has to choose one cluster, most probably this would be the cluster with the biggest probability of containing the data point in question).

30 Nov '21 ·

el mahdi chayti

Thank you for the quick answer!

I'm saying that Pi is a prior because in page 3 of the slides 11a there is specifically written that it is a prior, and we also said so during the lecture. At the end of the day we are in fact optimising for it as a model parameter, and when we marginalise it does produce soft-clustering, so it's not a big deal, but still I was expecting a more general model, like what I propose in question 3 of my original post.

Regarding the second part of the answer, what you say is clear, but I don't think that's what happens in the model, since in the posterior we treat \(z\) as a known (latent) variable: we neither optimise for its distribution (which is question 1 of my original post) nor we consider each \(z_n\) as a different distribution (which is question 3).
In K-means we can say to have a uniform distribution as \(\pi\), but that doesn't make K-means soft.

Again, when we marginalise the soft assignment is clear and evident. My point is that I was expecting it to be in the main GMM formula, and not to appear after the marginalisation! Thus I was expecting the model to be more general.

I'm thinking about this as I write, so my writing could appear quite unclear, I'm sorry for this!
But TL;DR:
In K-means: we have a prior, which is uniform, and we don't optimise for it; we optimise for the values of z according to the uniform prior and in hard way.
In GNN we have a different (more general) prior, and we even optimise for it, but this doesn't change the fact that we optimise for z according to the prior and in a hard way.
After marginalisation, again, yes, I see the soft-clustering (even though I expected it to be more generalised) because z is not anymore there, and is treated only through its distribution \(\pi\).

30 Nov '21 · 1 ·

anonymous

I don't quite understand what are you trying to say. I would advise you to wait for the next lecture where you will see how you can algorithmically solve the maximum likelihood problem presented by GMM.

GMM is indeed a generalization of K-means, you re-obtain K-means if you fix covariance matrices to be diagonal with entries going to zero. This makes the clusters spherical, with GMMs the clusters have an elliptical shape.

In addition, it makes sense to deviate a little bit from K-means as it is very costly, I suspect what you are proposing (even though I can't say I understand it clearly) would have the same problems, maybe even more.

As for the discussion about the prior, it was called in the lecture prior, but you should not understand it as in bayesian statistics (a prior over a parameter).

Don't see in things more than what they are, GMMs are defined this way, it is just a defiinition, it is only as important as what it let's us do.

1 Dec '21 · 1 ·

el mahdi chayti

Page 1 of 1

Add comment

How to style: strictly use the or click here. E.g., \(\alpha + \beta\) gives (inline) \(\alpha + \beta\). No \(\LaTeX\) preview (yet).