Questions about the lecture

Hello,
I have two questions regarding the lectures:
The first one is about parameter estimation of a power law distribution from a log-log plot. I understand, how to find the \( \gamma \) from both the CCDF and the PDF. But, is it possible to obtain \( \gamma \) from the log-log plot of the CDF of the distribution ?
About the other parameter \( \beta \), is it possible to estimate from the log-log plot ? And if yes how ?

And my second questions is about Gibbs Sampling: I'm not sure about the definition for the equivalence relation \( z\sim_k z'\). In the slide, it is written that \( z\sim_k z' \) if \( z' \) is equal to \( z \) except at position \( k \). Does it mean that \( z \) and \( z' \) must be strictly different at position \( k \) or can they be equal at position \( k \) ?

Thanks a lot for your help ,
Alessio

Top comment

Hello, thank you for all your questions and feedback! For overall readability and usability, could I ask you to post topically new questions as a separate new thread?

Regarding the power law: just to be clear, the \(\beta\) parameter is specific to the Pareto distribution, which is special in that it is an exact power law over its entire domain. With real data, is it much more common to see something like the curve labeled "asymptotic power law" in the plot you're citing, where we just identify a heavy tail that has (more or less) a constant slope in the log-lop plot.

Yes, the label $1$ refers to the "original" units, not the units after taking the log. The range of the y-axis after taking the log goes from 0 to \(-\infty\).

I would add two more questions:

  1. On slide 26 of Lecture 2, it is written that if \(c_u>>p\) the network has high clustering. This means that if we want to know if the network is highly clustered for a given network, we need to approximate \( p\) with \(\frac{2m}{n^2}\) (can remove the expectation since the network is known), is that correct ?

  2. I have a question about the slide 38 of lecture 2. I'm not sure to see why the clustering coefficient is goes to 0 as the ratio \(\frac kn\) is increasing. Could you explain this point a bit further ?

Finally, on the slide about LDA, you defined the Dirichlet distribution but on Wikipedia and in other papers, the factor

$$ \frac{\prod_{i=1}}^K \Gamma (\alpha_i)}{\Gamma (\sum_{i=1}^K \alpha_i)} $$

(in the slides) is the inverse i.e.

$$ \frac {\Gamma (\sum_{i=1}^K \alpha_i)}{\prod_{i=1}}^K \Gamma (\alpha_i)} $$

Which formula is the correct one?

Thanks a lot !

Power laws: The CDF is increasing (well, non-decreasing) with limit = 1 for large \(x\), whereas the CCDF is non-increasing with limit 0. So taking the log of the CDF implies a limit of 0, ie, the curve "flattens out" and cannot reveal the tail behavior. Only the log-log plot of the PDF or CCDF allows one to find \(\gamma\) via simple inspection of the plot.

The parameter \(\beta\) is simply the left boundary of the support of the Pareto distribution. In other words, to the left of \(\beta\), the PDF and CDF are zero.

Gibbs Sampling: no, \(z \sim_k z'\) does not imply that \(z\) and \(z'\) are different at position \(k\).

Clustering:

  1. Yes, that is correct.

  2. This plot is somewhat handwavy, and the clustering coefficient does not actually go to zero (the graph formed by only the shortcut edges is almost a \(G(n,p)\) random graph, which, as you know, has expected clustering coefficient \(p>0\). The point is that when the # of shortcuts becomes larger, the random graph adds many connected triples, most of which are not triangles -> the clustering coefficient decreases.

Of course, strictly speaking, for \(k \rightarrow \infty \), we'd end up with a complete graph, so "far to the right", the clustering coefficient has a limit of one.

LDA and Dirichlet distribution: Wikipedia is right (of course ;), and I have the inverse (which is obviously too small given that the Gamma function grows super-exponentially). Thank you, good catch! Will update the slides. (I should reassure you that there would not be an exam question involving the normalizing constant, given that you don't have calculators).

Btw, apologies that some of these answers are coming a bit late, I had overlooked the status email from oknoname. Do feel free to drop me an email when something doesn't get answered within a day or two, just in case.

@matthias said:
Power laws: The CDF is increasing (well, non-decreasing) with limit = 1 for large \(x\), whereas the CCDF is non-increasing with limit 0. So taking the log of the CDF implies a limit of 0, ie, the curve "flattens out" and cannot reveal the tail behavior. Only the log-log plot of the PDF or CCDF allows one to find \(\gamma\) via simple inspection of the plot.

The parameter \(\beta\) is simply the left boundary of the support of the Pareto distribution. In other words, to the left of \(\beta\), the PDF and CDF are zero.

Screenshot 2021-06-15 at 19.12.29.jpg

I'm not sure I understand for the \(\beta\) parameter. If we look at the above plot, in my opinion, the \(\beta\) parameter will be the moment when we start to have a slope, is that correct ?

Furthermore, in this plot, it is written that the CCDF has maximal value 1. But since, we are in log-log plot, why is this value 1 and not \(\log(1) =0\) ?

Also, I would have one more question about LDA. I understand that the goal is to extract a number of topics from a corpus of document without knowing the topics in advance. LDA will "produce" the \(\alpha \) vectors for each document (distribution of topic into this document) and the \(\beta \) vectors for each topic (distribution of word in the topic). Now, my questions are:

  • Since the LDA model is a probabilistic model, can we use it to generate new document ? If yes, how can we proceed ?
  • Can we use LDA directly to predict the topic inside a new document ? Or this isn't the goal of the model and we should use some alternatives processing /modelling ?

Finally, in the homework 6, ex 2 part b. We have to compare the following two probabilities \(P (G = 1|H = 1) > P (G = 1|H = 1, B = 1)\). This problem can be resolved numerically, using the probability in the previous point. But can we solve it intuitively ? I would say yes, as if we know that \(H=1\) this increases the chance of \(C=1\) and therefore the chance of \(G=1\). Now, if we know that \(B=1\) then \(C=1\) is less likely as we have already a way to explain why \(H=1\) (with B). therefore the second probability is smaller than the first one. Is that a correct reasoning ?

I think it is correct only if the variables are "positively correlated" i.e. if B positive, increase the chance of H to be positive for instance. Is it possible to have, in Bayesian Networks, variables that are "negatively correlated", for instance knowing B positive increase the chance of H to be negative ? If so, we cannot use the "intuitive" way of comparing probability right ?

Thanks a lot for all your other answers, it makes more sense now !

Top comment

Hello, thank you for all your questions and feedback! For overall readability and usability, could I ask you to post topically new questions as a separate new thread?

Regarding the power law: just to be clear, the \(\beta\) parameter is specific to the Pareto distribution, which is special in that it is an exact power law over its entire domain. With real data, is it much more common to see something like the curve labeled "asymptotic power law" in the plot you're citing, where we just identify a heavy tail that has (more or less) a constant slope in the log-lop plot.

Yes, the label $1$ refers to the "original" units, not the units after taking the log. The range of the y-axis after taking the log goes from 0 to \(-\infty\).

Add comment

Post as Anonymous Dont send out notification