These are chat archives for thunder-project/thunder

16th
Dec 2016
Davis Bennett
@d-v-b
Dec 16 2016 16:45
@jwittenbach is there a reason why thunder.series and thunder.images objects don't have filter methods?
Jason Wittenbach
@jwittenbach
Dec 16 2016 16:46
@d-v-b I think they do!
the method is defined on their parent class, so it’s documented under “base methods"
Davis Bennett
@d-v-b
Dec 16 2016 16:50
ah yes, there is a filter method, the problem I was having is that the filter method can't know about keys
map has a withkeys kwarg, but i don't think filter has the same
Davis Bennett
@d-v-b
Dec 16 2016 18:03
@jwittenbach performance question: suppose I want to run a series operation on some images, and I'm only interested in a subset of the image volume. If i have a binary mask image that determines which pixels I want to process vs. those I don't, is it more efficient to do something like my_images.map(lambda v: v[mask]).toseries() or to do my_images.toseries().tordd().filter(lambda kv: kv[0] in list_of_coordinates_from_mask)
assume both mask and list_of_coordinates_from_mask are broadcasted
Jason Wittenbach
@jwittenbach
Dec 16 2016 19:21
@d-v-b inre: filtering — I added the labels functionality to get around exactly this
it effectively lets you associate some metadata to each record (in either an Images or Series object)
once you do a filter, you run the risk of ruining the array-like structure of the data
so any dimesions that you filter over will be linearized
and the records that didn’t make it through the filter will be dropped
the labels will all be reshaped and filtered
so you can use them to figure out which records you’re left with
example:
data  = td.images.fromrandom((20, 10, 10))
data.labels = np.arange(data.shape[0]) # label images sequentially
filtered = data.filter(lambda x: x.sum() > 0)
print(filtered.labels)
when I run that, I get:
array([ 1,  6,  7,  8, 10, 11, 13, 14, 15, 16, 18])
so I can see which records made it through the filter
Davis Bennett
@d-v-b
Dec 16 2016 19:25
but how does that help me if I want to filter on the keys?
Jason Wittenbach
@jwittenbach
Dec 16 2016 19:25
oooh
in a lot of cases, can’t you just do that via indexing?
because the keys are effectively just the indices
Davis Bennett
@d-v-b
Dec 16 2016 19:28
what if I have a series object where each record is from a (z,y,x) coordinate
and I only want records from a subset of the full (z, y, x) space
Jason Wittenbach
@jwittenbach
Dec 16 2016 19:29
OK, well, this won’t be super memory efficient
but I think it will work
say that f is a function that takes (x, y, z) and tells you if it’s in the region you want to keep or not
then you could apply f to a list of all possible coordinates
and index with that boolean array
though I can see how that’s not the most elegant solution
at that point, I might just drop into RDD-land
Davis Bennett
@d-v-b
Dec 16 2016 19:32
yeah exactly
so shouldn't series.filter() have a withkeys kwarg?
Jason Wittenbach
@jwittenbach
Dec 16 2016 19:33
let me think about how I would do withkeys for filter
yeah, that might be the trick :)
Davis Bennett
@d-v-b
Dec 16 2016 23:16
@jwittenbach this what i'm doing now, it's ugly:
ser_masked = td.series.fromrdd(ser.tordd().filter(lambda v: v[0] in ref_mask_keys_bc.value))
Davis Bennett
@d-v-b
Dec 16 2016 23:26
if it's not too much trouble I can whip up a PR that implements withkeys on filter