Lab 4 - LSI

Hello,

We are having some trouble with the "svds" function in the part 2 of lab 4. It seems to us it is "semi-deterministic" in the following sense. When running multiple times this algorithm, we get different solutions and it seems to randomly alternate between 2 solutions.

One solution seems to be more or less meaningful while the other is rather strange (the 10 largest components of the left singular vector corresponding to the largest singular value are all the same and are very small (around -1e-5)).

Is there something particular to be aware of when using this function or treating its results ? (we looked at the doc and the source code of the function, it seems that the vectors and singular values are always ordered. So we think this should not be a problem regarding this but rather something more deep like the solver or setting a random seed somewhere).

Thank you for your time and advice !

Cyrille Pittet

Top comment

If you checked the TF IDF matrix and also normalised the document vectors, then this is probably due to the randomness of the solver. You could try alternative solvers (like the LOBPCG solver instead of the default ARPACK) and compare solutions, as mentioned in my earlier answer - you can pass arguments to choose the solver in svds.

Note also that the SVD is unique only up to a sign change in the singular vectors (i.e it is normal to get different signs for the the components of the singular vectors at each run although the magnitudes should be the same).

I assume you checked that the singular values are ordered correctly (the documentation mentions this is not guaranteed). Are the singular values also different, or just the singular vectors? I would suggest that you check that there are no unexpected values in the TF IDF matrix. Normalising the document vectors to have unit norm is also helpful. It could also be because of the randomness in the underlying solver in the svds function in which case you could try alternative solvers (like the LOBPCG solver instead of the default ARPACK) and compare solutions. But do make sure there are no other, more obvious mistakes first.

The singular values are always the same. It is only the singular vectors that are changing.

We will look into our TF-IDF matrix then.

Thank you !

I work on the project with Cyrille and we have the exact same code. When we run multiple time the svds function only (so without re-computing the TF_IDF matrix), we obtain, as said by Cyrille, different result in different order. (We also check how we compute the TF_IDF matrix and we think it's the right way to do it)

It seems that the computation is oscillating between 3 different solutions, always with the same singular values but different singular vectors (maybe a numerical instability in the svds functon ?).

Since the difference comes from the svds output, we don't really see what we can do to overcome this problem. Is it okay to run the function multiple times until we obtain some meaningful result to answer all the questions asked in the handout ?

Furthermore, we looked into the source code of the svds function and they do a kind of sorting at the end, so I suppose we don't have sort the singular values ourselves.

Thank you for your time !

Top comment

If you checked the TF IDF matrix and also normalised the document vectors, then this is probably due to the randomness of the solver. You could try alternative solvers (like the LOBPCG solver instead of the default ARPACK) and compare solutions, as mentioned in my earlier answer - you can pass arguments to choose the solver in svds.

Note also that the SVD is unique only up to a sign change in the singular vectors (i.e it is normal to get different signs for the the components of the singular vectors at each run although the magnitudes should be the same).

Add comment

Post as Anonymous Dont send out notification