These are chat archives for thunder-project/thunder

20th
Oct 2015
Jeremy Freeman
@freeman-lab
Oct 20 2015 15:18
@poolio very exciting! look forward to checking it out
@mtv2101 sorry i'm late getting to this, dots in names can definitely cause trouble, should have asked earlier if that was it!
it's incredibly annoying
this trick (from thunder's EC2 environment setup) fixes some of those problems https://github.com/thunder-project/thunder/blob/master/thunder/utils/ec2.py#L232
but can introduce across-region issues
i think with ordinary calling format set, and if you're in the same region, everything is fine
Jeremy Freeman
@freeman-lab
Oct 20 2015 17:43
@mtv2101 moving the correlation discussion into the main thread, @jwittenbach 's thinking about it too
fyi all we're thinking about how to compute all-to-all pixel correlation matrices
Jeremy Freeman
@freeman-lab
Oct 20 2015 17:52
lots of things work well for computing k x k matrices when n >> k
but we want n x n matrices when n >> k (or at least, n is really large and k is also pretty big)
(where "lots of things" -> reduce-style approaches where you aggregate distributed outer products)
Davis Bennett
@d-v-b
Oct 20 2015 17:55
assuming fish-sized images, num pixels squared = (2048 x 1024 x 40)**2, that's pretty big
Jeremy Freeman
@freeman-lab
Oct 20 2015 17:55
@mtv2101 was being modest, just (512 x 512) ** 2
Davis Bennett
@d-v-b
Oct 20 2015 17:56
that seems more reasonable :smile:
Matt Valley
@mtv2101
Oct 20 2015 17:57
my intuitions are not so developed here .. all-to-all pixel correlations could be done on reasonably large but local groups of pixels in spatialseries RDDs. But would the cost of shuffling between groups of pixels be too high?
Jeremy Freeman
@freeman-lab
Oct 20 2015 17:58
yup, was thinking the same, probably even better to do it on blocks
basically, you'd turn the image collection into xy x t blocks
compute local xy x xy submatrices
and stitch them together
of course, that only computes a subset of the full correlation
Matt Valley
@mtv2101
Oct 20 2015 17:59
at unint16 the 512x matrix is ~140Gb, so could be broken down into 8ish blocks
for example
but if the timeseries is very long that changes everything
Jeremy Freeman
@freeman-lab
Oct 20 2015 18:01
so actually, the best approach might be something called a "map side join"
the full data actually isn't huge, O(GB)
so what we might be able to do is take the raw data, and in parallel compute correlations against another, broadcasted version of the raw data
might sound weird, but that gives you the whole matrix, and the result is itself parallelized, so you could feed it directly into distributed kmeans or something
i've used this effectively where the raw data ~200MB and the correlation matrix is ~50GB
Jeremy Freeman
@freeman-lab
Oct 20 2015 18:11
all that said, the following does "work" and should behave better memory-wise than the Statistics.corr
from numpy import corrcoef

data = tsc.loadExample('mouse-series')
data.rdd.values().cartesian(data.rdd.values()).map(lambda (x, y): corrcoef(x, y)[0,1])
that gives you an distributed collection of all correlation coefficients
it's just a bit inefficient because it's working on individual records at a time
and you probably wouldn't want to collect that to the driver =)
Jason Wittenbach
@jwittenbach
Oct 20 2015 19:05
I was looking into how to put the results for the cartesian back together — e.g. putting each row of the matrix into a single record in an RDD so that you could use them as features for clustering, dimensionality reduction, etc
The problem is that requires a big groupByKey operation
trying it on the mouse-series data fails for me on a 20-node cluster, so the map side join solution that avoids the shuffle seems like the better solution
and the data-replication cost should be the same as the cartesian solution
Jason Wittenbach
@jwittenbach
Oct 20 2015 21:01
@mtv2101 here’s a code snippit for doing for the map-side join option — I have it working on a data-set that’s 512-by-512-by-5000 (x-y-t)
import numpy as np

data = tsc.loadExmaple(‘mouse-series’)
data.repartition(100)

dataBC = sc.broadcast(data.collectValuesAsArray())

corrs = data.applyValues(lambda v: np.dot(v, dataBC.value.T))
corrs.cache()
corrs.count()

from thunder import KMeans
model = KMeans(6).fit(corrs)
labels = KMeans.predict(model).pack(sorting=True)

import matplotlib.pyplot as plt
plt.imshow(labels)
Jason Wittenbach
@jwittenbach
Oct 20 2015 21:06
Gives the following kind of thing:
clustered.png
This dataset is much smaller than yours might be, but I scaled it up to something 512-by-512-by-5000 and it worked still…took maybe 5-10 mins on a 20 node cluster
I’m computing raw correlations since NumPy can do that with a fast matrix multiplication, but you could swap out that applyValues bit for something that does a correlation coefficient instead if you wanted
Matt Valley
@mtv2101
Oct 20 2015 22:35
wow this is fantastic! I'll give it a try right now