These are chat archives for thunder-project/thunder

8th
Aug 2015
Jason Wittenbach
@jwittenbach
Aug 08 2015 13:21
@neuromusic I think your problem is with this line:
y = sc.parallelize(ts.rdd.take(2))
take(2) does not take the second vector, but instead grabs the first two records
also, you’re grabbing records — which include both keys and values — and not just the vectors in the values, which is probably what you want
Jason Wittenbach
@jwittenbach
Aug 08 2015 13:32
you might want to try something more like this:
v1, v2 = zip(*ts.take(2))[1]
x, y = sc.parallelize(v1), sc.parallelize(v2)
Statistics.corr(x, y, ‘pearson’)
Jeremy Freeman
@freeman-lab
Aug 08 2015 14:01
yup, @jwittenbach ’s solution will work
you could also grab two particular vectors using indexing notation, e.g. to get the two time series at pixel locations (0,0) and (0,1) you could do
v1 = ts[0,0]
v2 = ts[0,1]
note that this will be much, much faster if you’ve first run
ts.cache()
ts.count()
and that all-to-all version would also work if you did
c = Statistics.corr(ts.values())
c.shape
-> (100, 100)
but it’s 100x100 because it’s computing correlations among the columns
Justin Kiggins
@neuromusic
Aug 08 2015 16:47
I see. Looks like I should refresh myself on the thunder basics :)
OK, second question:
This seems a bit convoluted to go from the ts RDD to the v1 & v2 numpy arrays, then turn each back into an RDD to run the correlation. Is there some way to index the RDD directly and pass the paralellized time series into Statistics.core?
Jeremy Freeman
@freeman-lab
Aug 08 2015 16:53
definitely agree =) perhaps more to the point, if you’re just correlating two vectors, maybe just use numpy?
Justin Kiggins
@neuromusic
Aug 08 2015 16:55
but I already know how to use numpy :P
Jeremy Freeman
@freeman-lab
Aug 08 2015 16:55
ha i know, i guess i’m asking (and maybe you just answered), are you trying to understand the API, or maybe wishing there was additional functionality?
because we could definitely add methods on the Series object to do things like correlate or cross-correlate among the different time series
there are already corr and crossCorr but those correlate all time series to a single target
Justin Kiggins
@neuromusic
Aug 08 2015 16:57
more trying to understand the API, where it's limits are, where Spark's limits are, and wrap my head around when it makes sense to do things in spark vs numpy
Jeremy Freeman
@freeman-lab
Aug 08 2015 16:58
gotcha, those are great questions
and we could do better to clarify
we’re in the middle of a new project that will make all these objects (Images, Series, etc) look and behave a lot more like numpy arrays (or numpy arrays with extra methods)
and at that point, at least for this stuff, the spark vs numpy question will just be a matter of speed
Justin Kiggins
@neuromusic
Aug 08 2015 17:00
after spending a lot of time w/ pandas, I've gotten used to being able to throw a pd.Series into anything that takes numpy arrays. my first intuition was to be able to do the same w/ a Thunder Series & spark functions
but tbh, I haven't pushed on Thunder enough to offer any more insight than that
Jeremy Freeman
@freeman-lab
Aug 08 2015 17:02
ah interesting, you mean, you felt you should be able to throw a Thunder object into any Spark function?
Justin Kiggins
@neuromusic
Aug 08 2015 17:03
yeah, exactly.
Jeremy Freeman
@freeman-lab
Aug 08 2015 17:03
so right now that’s generally true if you call Series.values()
Justin Kiggins
@neuromusic
Aug 08 2015 17:03
(not saying that's the way it should be, just that that was my first intuition)
ok
Jeremy Freeman
@freeman-lab
Aug 08 2015 17:04
yeah, definitely great feedback
intuition is valuable =)
Jeremy Freeman
@freeman-lab
Aug 08 2015 17:14
one other thing… if you want to play around with the neurofinder data and want to ignore Spark / Thunder entirely, you can call tslocal = ts.collectValuesAsArray()
that just gives you a local numpy array
Justin Kiggins
@neuromusic
Aug 08 2015 17:16
good to know. but then I won't learn Thunder :)
Jeremy Freeman
@freeman-lab
Aug 08 2015 17:17
ha, gp, wouldn't want that =)