These are chat archives for thunder-project/thunder

21st
May 2016
Boaz Mohar
@boazmohar
May 21 2016 01:03
@kkcthans Hi, first with_keys has a syntax of:
def func(kv):
    key, value = kv
    # do something
   return value

data = data.map(lambda x: func(x), with_keys=True)
Second I think the error you got has to do with the fact that you tried to act on a RDD within the lambda.
If you need to act on two distributed arrays (Image objects or RDDs) you would need to convert them to RDDs with .tordd() and use spark methods like cogroup
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD
Boaz Mohar
@boazmohar
May 21 2016 01:19

Regarding the max question, if your data is an images object, converting to series or mapping as series won't be faster (I think). I know there is an algorithmic difference between the first two options, but there are also other factors that might dominate the optimization. The main one is the number of partitions npartitions. images.max() performs a reduce within all the records of a partition and then your np.max(...) will work across spatial dimensions. On the other hand, .map(...) first will work on each time point and then you would need to collect each time point and your np.max(...) will work across the time domain. For both you need to add .toarray()' before thenp.max()`.

tl;dr time it with different number of partitions.

Kyle
@kr-hansen
May 21 2016 15:54
@boazmohar Thanks for your explanation of with_keys. I'll play around a little more with looking at doing .tordd() and using spark methods. I'm anticipating I can do a tordd().cogroup().reduceByKey(np.subtract()) or something like that to do image by image subtraction within a volume. If you're aware of any other options, I'd be happy to know about them.
Also, is anyone aware of why .save() was removed from theRegistrationModel for Thunder 1.0.0? I've tried just saving it using json, but it says RegistrationModel is not serializable.
Jason Wittenbach
@jwittenbach
May 21 2016 16:24
@kkcthans Adding element-wise addition in Bolt (and by extension Thunder) is something that we’ve talked about for a while, we just haven’t gotten around to implementing it. I can imagine doing it like you mention — cogroup + reduceByKey — or doing it as join + map. I would be interesting to see which of those is faster.
Then all we would need to do is to overload the + operator on the BoltArraySpark to make that work
Kyle
@kr-hansen
May 21 2016 16:25
I'll play around with it and post any results I find in terms of speed
Jason Wittenbach
@jwittenbach
May 21 2016 16:26
Cool, could definitely be an interesting PR for Bolt :)
As for finding the max image, not all of those code snippits you wrote will do the same thing
some of them will compute the pixel-wise max across all images first, while others will compute the max for each image first
hard to say which will be faster
both involve some kind of aggregation
the former will do it with a reduce
while the latter will do it with a shuffle and then a map
my guess is that the reduce should be faster
definitely don’t go the map_as_series route
that ends up doing 2 shuffles, which will be extremely slow / intensive, and should only be used if there’s no other way to get what you want :)
Kyle
@kr-hansen
May 21 2016 16:30
Cool. I've been playing with that some as well.
I'll post anything interesting I find.
Also, what is the best current way to save a RegistrationModel in Thunder 1.0.0? Is there a way to save the whole object, to later load and transform data?
Jason Wittenbach
@jwittenbach
May 21 2016 16:31
awesome, look forward to hearing what you find
Ah yeah, so I was talking to @freeman-lab about this yesterday actually
Kyle
@kr-hansen
May 21 2016 16:31
I know you could save the whole object in Thunder 0.6, but I wasn't sure why that was removed
Jason Wittenbach
@jwittenbach
May 21 2016 16:31
he has plans to reimplement that in a more general way that will let us save all sorts of data structures to disk
but it’s not in 1.0.0 right now
you can defintiely call np.save(“filename.npy”, model.toarray()) to save out the raw values
but unfortunatly there is currently no easy way to reconstitute the array back into a RegistrationModel
Kyle
@kr-hansen
May 21 2016 16:34
Ok. Do you know why the RegistrationModel is no longer JSON Serializable? I know that is how it was saved previously.
I'm not as familiar with the underlying structure of the model. Would there be a quick way using the json module in Python to convert it to be json serializable as an object/class?
Jason Wittenbach
@jwittenbach
May 21 2016 16:34
Yeah, the plan is to make a mixin that lets us make any model JSON Serializable
Effectively generalizing the code that we previously used for RegistrationModel
Yeah, that’s exactly what the plan is — to take the code that we used to use for serializing the RegistrationModel and turn it into a standalone package exactly like you’re saying, so that we can use it throughout Thunder
not sure if there is an alternative already out there though
There are defintiely things to serialize Python built-in types
but we need to be able to serialize our own classes too (e.g. RegistrationModel contains a collection of Transforms objects, etc)
Kyle
@kr-hansen
May 21 2016 16:37
Ok, sounds good. I was playing around with json, and might try playing with Pickle for the time being to see if I can get it to work quickly
Jason Wittenbach
@jwittenbach
May 21 2016 16:37
cool
yeah, pickle works great for built-ins :)
Jeremy Freeman
@freeman-lab
May 21 2016 17:01
I should have the generic serializable thing done in the next day or two
it's one of those weird things where there's no built in or existing package but you'll find random chunks of code to do it all over the web
so we'll just make it a nice module =)
Kyle
@kr-hansen
May 21 2016 17:02
Awesome. I got pickle working for what I need for now, so whenever that's done I can just reload my modules I want from pickle and resave them with the new module