These are chat archives for thunder-project/thunder

3rd
May 2015
asishgeek
@asishgeek
May 03 2015 03:58
@freeman-lab Hi Jeremy, I'm trying to use the PCA code in the thunder project but I suspect the code isn't returning correct results or I might be doing something wrong. E.g. if you take the simple array: X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) the PCA code in thunder returns the following principal components: array([[-0.83849224, -0.54491354],
[ 0.54491354, -0.83849224]]) while the result from scikit-learn is array([[ 0.83849224, 0.54491354],
[ 0.54491354, -0.83849224]]). Also when I call transform on the same array and compute the variance along each PC I get ([ 0.16666667, 0.16666667]) which is wrong because the variance along the first PC should be larger than the second. I'm getting similar results for my actual example i.e. the variance alone each PC is the same. Do you know what might be going on here? Thanks a lot for putting together this project. It's awesome.
Jeremy Freeman
@freeman-lab
May 03 2015 13:13
@asishgeek thanks for the feedback! regarding the principal components, there is an equivalence up to sign flip for PCA, and if you notice it's equivalent to the answer from sklearn up to a sign flip on the first component
some libraries enforce conventions to fix the sign one way or the other (e.g. make it so the largest element in each column is positive), i'm not sure if sklearn does or not, matlab does
regarding the variance explained on the transformed data, can you say more about how you're computing that? i'm confident the coefficients are identical to what sklearn does (up to the sign flip), but there may be differences in how explained variance is computed, and also how mean subtraction is handled during the transformation
i've been meaning to look into this more
asishgeek
@asishgeek
May 03 2015 14:29
@freeman-lab Thanks for the reply. I get your point about the sign flip. To compute the variance I do the following for sklearn: np.var(pca.transform(X), axis=0) which gives me the result array([ 6.61628593, 0.05038073]). While, in thunder I do this: X_t = pca.transform(X), X_t.variance() which returns ([ 0.16666667, 0.16666667]). For the latter case X is a RowMatrix.
Richard A Hofer
@rhofour
May 03 2015 15:19
@freeman-lab I'm not sure where you want that line break to go
Jeremy Freeman
@freeman-lab
May 03 2015 15:32
@rhofour sorry, just added a clarification =)
Richard A Hofer
@rhofour
May 03 2015 15:33
Ah, didn't do that with my other example dataset. I'll update both of them.
Jeremy Freeman
@freeman-lab
May 03 2015 15:33
oh great, must've missed the first one
Richard A Hofer
@rhofour
May 03 2015 15:34
^- done
Jeremy Freeman
@freeman-lab
May 03 2015 15:43
@asishgeek :point_up: May 3 2015 10:29 AM great, that helps explain it! in PCA.transform there's a step normalizing each transformed variable by the corresponding latent value (here). if i remove that, with your example, I get exactly the same output as the sklearn code. that extra division was inherited from a similar calculation in the SVD, but it may not be appropriate here given how people usually use transforms in PCA, so i think we'll remove it!
Jeremy Freeman
@freeman-lab
May 03 2015 15:53
@rhofour thanks, merged!
Richard A Hofer
@rhofour
May 03 2015 16:00
:) Always happy to see stuff get merged upstream
Currently working on actually benchmarking our inverse
If we can make it competitive with other frameworks I'll send a huge PR
asishgeek
@asishgeek
May 03 2015 16:51
@freeman-lab :point_up: May 3 2015 11:43 AM Got it. ! Thanks a lot for the clarification.