These are chat archives for thunder-project/thunder

21st
Oct 2015
Matt Valley
@mtv2101
Oct 21 2015 20:10
@jwittenbach I think we are running from slightly different versions of thunder since I can't use predict from the KMeans class. This worked well on the mouse-series
from thunder import KMeans, KMeansModel
model = KMeans(6).fit(corrs)
labels = KMeansModel.predict(model, corrs).pack(sorting=True)
Jason Wittenbach
@jwittenbach
Oct 21 2015 20:12
@mtv2101 that’s my bad; there’s a typo in my code. The line computing the labels should use the model that was fit in the previous line:
Matt Valley
@mtv2101
Oct 21 2015 20:12
But on EC2 neither the mouse-series or my own data will finish the count operation
Jason Wittenbach
@jwittenbach
Oct 21 2015 20:13
from thunder import KMeans
model = KMeans(6).fit(corrs)
labels = model.predict(data)
though your version should work as well
How large is the EC2 instance that you’re using?
Matt Valley
@mtv2101
Oct 21 2015 20:15
sorry, scratch that, only my own data doesn't run. I was curious what type of node you used for your (512x512)**2 matrix?
Jason Wittenbach
@jwittenbach
Oct 21 2015 20:15
Though, for the mouse-series data, that should be doable on a single (though perhaps pretty beefy) node
When I was trying that out, I was using 20 nodes
Each node had 16 cores and 256GB of RAM
Matt Valley
@mtv2101
Oct 21 2015 20:15
I was running 20 workers (m3.2xlarge)
Jason Wittenbach
@jwittenbach
Oct 21 2015 20:16
so 320 cores + 5 TB RAM
Matt Valley
@mtv2101
Oct 21 2015 20:16
oh ok your nodes are very fancy :smile:
each m3.2xlarge node has only 24GB
Jason Wittenbach
@jwittenbach
Oct 21 2015 20:17
yeah, I was using Janelia’s cluster, which definitely treats us well :smile_cat:
even with all of that power, it still took maybe 10 minutes to finish the computation
Matt Valley
@mtv2101
Oct 21 2015 20:22
i'll try to spin up some big r3 instances and see if I can get it to run.
But do you think it is possible to configure the lower RAM nodes to run this successfully? The workers die left and right, and my data is not even that large (1000,512,512).
Jason Wittenbach
@jwittenbach
Oct 21 2015 20:25
The issue is with how we’re solving the problem
The first step — dataBC = sc.broadcast(data.collectValuesAsArray()) — is effectively making copies of the entire dataset and shipping it out to each worker
each worker only has a subset of the series, but to compute a row of the correlation matrix, it needs access to all of the other rows
so you’ve hit on exactly the trade-off that we’re dealing with here
this implementation is very computationally efficient, but memory inefficient
the other end of the spectrum would involve shuffling the series around the workers a number of times until they had all been together on the same worker at least once, to compute their pair-wise correlation
this would be very memory efficient, but very complicated….plus the overhead of sending that much data back-and-forth over the network would become a serious issue
however, it might be possible to find some trade-off
where you only “broadcast” subsets of the entire dataset at a time
making the chunks small enough to fit in RAM on the workers
Matt Valley
@mtv2101
Oct 21 2015 20:31
Its still unclear to me why ~20GB RAM cant get this done. The data is ~1GB, and the output is ~140GB / #workers since each worker only computes a fraction of the rows. So shouldn't the total size in each worker's memory only be ~8GB?
assuming 20 workers
Jason Wittenbach
@jwittenbach
Oct 21 2015 20:32
hmmm, good point, that sounds right to me
Jeremy Freeman
@freeman-lab
Oct 21 2015 20:32
unfortunately memory usage in spark can be unexpected and hard to debug
Jason Wittenbach
@jwittenbach
Oct 21 2015 20:32
though I have no clue how Spark actually handles all of this
Jeremy Freeman
@freeman-lab
Oct 21 2015 20:32
often the resulting objects in memory are larger than raw calculations would predict
@mtv2101 it can be worthwhile to decrease the size of the data, test a run, make sure it works, then increase to see exactly where it breaks
Matt Valley
@mtv2101
Oct 21 2015 20:34
hrm this is all very tricky
I've spent a decent amount of time looking through the error logs on the dead workers, but its rarely evident what causes the death
Jeremy Freeman
@freeman-lab
Oct 21 2015 21:24
my guess is it's out of memory, even if you don't see the error
unfortunately this solution is a memory hog, with either nodes with more memory, or spending some time with the memory configuration, it should definitely work
but it might be worth exploring the block-based route
computing pixel-to-pixel correlations within spatially locally blocks would probably behave better memory-wise, but it wouldn't give the full correlation matrix, so the question maybe is how important it is the have the full thing
it would be more like a block-diagonal-ish version of the matrix (if the matrix was sorted spatially)
Matt Valley
@mtv2101
Oct 21 2015 22:25
Does the default persistance type used by thunder allow data storage on disk? How would I change this?
Matt Valley
@mtv2101
Oct 21 2015 23:13
I'll answer my own question. Looks like a simple persist(MEMORY_AND_DISK) will do it in place of cache()