These are chat archives for thunder-project/thunder

16th
Feb 2015
Davis Bennett
@d-v-b
Feb 16 2015 16:20
that's what I was thinking -- you need some way of tracking the dimensions of the data other than the keys themselves. but it would speed up every voxel-wise calculation by at least 1/3
assuming zebrafish brain data with the background masked
Davis Bennett
@d-v-b
Feb 16 2015 16:43
@freeman-lab I notice that lots of Series methods which assume the values have numeric type (e.g. Series.max()) break for the output of RegressionModel, where the values have type 'Object'
Jason Wittenbach
@jwittenbach
Feb 16 2015 18:32
@d-v-b I actually started working on a DataTable class that would let us handle things like that, and it was motivated by exactly the issue you bring up: the Regression output and the desire to keep Series records restricted to storing an ndarray with numeric type elements. However it came to light Spark is actually going to be replacing their SchemaRDD object with something very similar to what I was working on. Thus we shelved working on the DataTable in order to see what the Spark folks come out with that we might leverage.
Davis Bennett
@d-v-b
Feb 16 2015 18:35
sounds good :hamburger:
Davis Bennett
@d-v-b
Feb 16 2015 21:19
so @jwittenbach what was your work-around to dealing with the output of RegressionModel ? For instance, I need to change the datatype of the output arrays to float16, which is impossible using the Series.astype method, so I use Series.applyValues to recast the data, but this somehow breaks series.select, so I also have to use Series.applyValues to get my data out and packed...
Jason Wittenbach
@jwittenbach
Feb 16 2015 21:27
I usually just resort to using the underlying RDD and map operations -- e.g. if I wanted to collect all of the things in the second slot of the records:
data.rdd.map(lambda (k, v): v[1]).collect()
Davis Bennett
@d-v-b
Feb 16 2015 21:28
yeah that seems like a good fallback strategy
Jason Wittenbach
@jwittenbach
Feb 16 2015 21:28
That's interesting that using Series.applyValues breaks Series.select though
I can't think of any reason why it should do that....
Jason Wittenbach
@jwittenbach
Feb 16 2015 21:34
Because none of the Data.apply type of functions don't touch the index, which is what Series.select uses to grab the right values
Davis Bennett
@d-v-b
Feb 16 2015 21:49
here's the code that worked for me:
newType = 'float16'

result = regressmodel.fit(imDat)
result = result.applyValues(lambda x: [x[0].astype(newType), x[1].astype(newType), x[2].astype(newType)])
result.cache()

betas = result.applyValues(lambda x: x[0])
stats = result.applyValues(lambda x: x[1])
resid = result.applyValues(lambda x: x[2])
the
betas = result.applyValues(lambda x: x[0])
stuff was because result.select('betas') didn't work
and result.select('betas') broke after I changed the types using applyValues
Jason Wittenbach
@jwittenbach
Feb 16 2015 21:50
Ah, ok, I think I see why select doesn't work here
Davis Bennett
@d-v-b
Feb 16 2015 21:51
am I fiddling with the dimensions / shape of result.values?
Jason Wittenbach
@jwittenbach
Feb 16 2015 21:51
On your first applyValues, you're returning a list; Thunder wants that to be an ndarray
Davis Bennett
@d-v-b
Feb 16 2015 21:51
aha
Jason Wittenbach
@jwittenbach
Feb 16 2015 21:52
so just throw an numpy.array around the return value and it should work :smiley_cat:
Davis Bennett
@d-v-b
Feb 16 2015 21:52
sweet, thanks! :hatching_chick: