@neuromusic I think your problem is with this line:

`y = sc.parallelize(ts.rdd.take(2))`

`take(2)`

does not take the second vector, but instead grabs the first two records
also, you’re grabbing records — which include both keys and values — and not just the vectors in the values, which is probably what you want

you might want to try something more like this:

```
v1, v2 = zip(*ts.take(2))[1]
x, y = sc.parallelize(v1), sc.parallelize(v2)
Statistics.corr(x, y, ‘pearson’)
```

yup, @jwittenbach ’s solution will work

you could also grab two particular vectors using indexing notation, e.g. to get the two time series at pixel locations (0,0) and (0,1) you could do

```
v1 = ts[0,0]
v2 = ts[0,1]
```

note that this will be much, much faster if you’ve first run

```
ts.cache()
ts.count()
```

and that all-to-all version would also work if you did

```
c = Statistics.corr(ts.values())
c.shape
-> (100, 100)
```

but it’s 100x100 because it’s computing correlations among the columns

I see. Looks like I should refresh myself on the thunder basics :)

OK, second question:

OK, second question:

This seems a bit convoluted to go from the

`ts`

RDD to the `v1`

& `v2`

numpy arrays, then turn each back into an RDD to run the correlation. Is there some way to index the RDD directly and pass the paralellized time series into `Statistics.core`

?
definitely agree =) perhaps more to the point, if you’re just correlating two vectors, maybe just use numpy?

but I already know how to use numpy :P

ha i know, i guess i’m asking (and maybe you just answered), are you trying to understand the API, or maybe wishing there was additional functionality?

because we could definitely add methods on the

`Series`

object to do things like correlate or cross-correlate among the different time series
there are already

`corr`

and `crossCorr`

but those correlate all time series to a single target
more trying to understand the API, where it's limits are, where Spark's limits are, and wrap my head around when it makes sense to do things in spark vs numpy

gotcha, those are great questions

and we could do better to clarify

we’re in the middle of a new project that will make all these objects (

`Images`

, `Series`

, etc) look and behave a lot more like numpy arrays (or numpy arrays with extra methods)
and at that point, at least for this stuff, the

`spark`

vs `numpy`

question will just be a matter of speed
after spending a lot of time w/ pandas, I've gotten used to being able to throw a pd.Series into anything that takes numpy arrays. my first intuition was to be able to do the same w/ a Thunder Series & spark functions

but tbh, I haven't pushed on Thunder enough to offer any more insight than that

ah interesting, you mean, you felt you should be able to throw a Thunder object into any Spark function?

yeah, exactly.

so right now that’s generally true if you call

`Series.values()`

(not saying that's the way it should be, just that that was my first intuition)

ok

yeah, definitely great feedback

intuition is valuable =)

one other thing… if you want to play around with the neurofinder data and want to ignore Spark / Thunder entirely, you can call

`tslocal = ts.collectValuesAsArray()`

that just gives you a local numpy array

good to know. but then I won't learn Thunder :)

ha, gp, wouldn't want that =)