These are chat archives for thunder-project/thunder

21st
Apr 2016
Jason Wittenbach
@jwittenbach
Apr 21 2016 15:28
@freeman-lab just finsihed adding a fully-featured local version of Blocks
next up in wiring in the padding functionality that already exists into Thunder
pondering what we need in terms of functionality to make thunder-extraction work
right now, thinking thinking the right way to go might be to add a map_to_array method on Blocks
it would take a function to apply to each block (padded or no), apply it, and then collect the results into an array with dtype=object (either NumPy.ndarray or BoltArraySpark) with dimensions that index the block that it came from
does that seem reasonable?
Jason Wittenbach
@jwittenbach
Apr 21 2016 15:34
then we could have a strict rule that the plain Blocks.map isn’t allow to change the size of the blocks, period
Steve Varner
@stevevarner
Apr 21 2016 16:22
I'm looking at the code in images.py for first() and I see that if you're in Spark mode, it calls the function on a Bolt array called "tordd". Am I correct when assuming that this is not actually creating an RDD, rather it's just getting the RDD value that currently exists the Bolt array? Would a name of something like "getrdd" be better than "tordd". Thanks.
Jason Wittenbach
@jwittenbach
Apr 21 2016 17:22
@stevevarner that’s exactly right
As for the name: I think of tordd as casting an Images object to an RDD. Of course, all this entails under the hood is extracting the RDD that we have already created, but the user doesn’t need to know that.
Steve Varner
@stevevarner
Apr 21 2016 18:49
Thanks. That makes sense now that I understand how it works.
Have you given any thought to being able to write your BoltArray RDD to HDFS to persist that as an intermediate step so that we don't have to reload the entire image series every time we want to do something with the data?
Also maybe put a metadata string as the 2nd half of the tuple in the key of the Bolt Array so that we can call the filter function on the RDD of the Bolt Array to filter by something like date ranges that would be within the metadata?