These are chat archives for thunder-project/thunder

26th
Apr 2016
Thomas Richner
@tjr1
Apr 26 2016 03:50
@d-v-b @jwittenbach Thanks very much. Super excited to kick the tires of Spark for some calcium imaging data.
AJ Keller
@aj-ptw
Apr 26 2016 10:46
Hey everyone! I'm new here, but I plan to use Thunder as the signal processing part of my EEG pipeline. I have writes going to Cassandra, then I plan to take that raw data, convert to floats, do my feature extraction with Thunder, then use scikit-learn either test against a model or train a new one! If anyone knows of any open source projects
Using a pipeline close to that, I would appreciate some links!
I already have the spark/Cassandra cluster up and running and the pipeline built to get data to Cassandra.
Henry
@hluetck
Apr 26 2016 13:34
Hey everyone! I have a simple question: I converted a Spark RDD into a Thunder Series object like this:
series = td.series.fromrdd(rdd)
I can get the first record from series as an array using series.first(). But how can I get the nth record. In the previous thunder version (0.6), it was possible to do series[n], however this does not work anymore. Any help would be greatly appreciated.
Davis Bennett
@d-v-b
Apr 26 2016 14:35
@hluetck you can now do series[n] to get the nth record
I'm a little confused, because i don't think this was possible in thunder 0.6
since indexing into series object using square brackets was only added with thunder 1.0, iirc
what version of thunder are you using?
Henry
@hluetck
Apr 26 2016 14:42
I am using thunder-python (1.0.0).
series[n] returns another Series object. However, I'd like to get the array, as returned by series.first(). Does this make sense?
Jeremy Freeman
@freeman-lab
Apr 26 2016 14:45
@hluetck ah! can you try doing series[n] and then call .toarray()?
Davis Bennett
@d-v-b
Apr 26 2016 14:45
yeah I see what you mean, you can do series[n].values in local mode, or series[n].toarray()
Jeremy Freeman
@freeman-lab
Apr 26 2016 14:46
in general now for consistency all operations return the original object type, but you can always get an array with .toarray()
Henry
@hluetck
Apr 26 2016 14:54
I actually tried this, but get an error: http://pastebin.com/NH5KQp53
Jason Wittenbach
@jwittenbach
Apr 26 2016 15:49
@hluetck the RDD that you pass to fromrdd needs each record to be a key-value pair where the key is a tuple representing the index. From the error you posted, my guess is that your keys are currently ints and that’s what’s causing the error. Try wrapping them in a single-element tuple:
data = fromrdd(rdd.map(lambda kv: ((kv[0],), kv[1])))
Jeremy Freeman
@freeman-lab
Apr 26 2016 16:22
@hluetck @jwittenbach if that ends up being it we might want to do some optional format checking in fromrdd
basically, call first and validate the format
unless it's one of our own loading methods and we know it'll be correct
Kyle
@kr-hansen
Apr 26 2016 17:23
This seems somewhat related to issue #287, though with it being about series rather than images. Is there a rationale requiring calling .toarray() on image objects even after running .max() or .mean()? I had also thought that running .squeeze() would give the array output I anticipated, but .toarray() uses a different .squeeze() than the images.squeeze() and they behave differently.
Jeremy Freeman
@freeman-lab
Apr 26 2016 22:11
@kkcthans great questions, so both behaviors come from the following two rationales (1) all operations on series and images should return an object of the same type and (2) both data types contain a non-zero number of the underlying objects, and the first dimension indexes that number
Jeremy Freeman
@freeman-lab
Apr 26 2016 22:19
point (2) in particular is why calling e.g. .mean() on a (100, 10, 20) images gives you (1, 10, 20) instead of (10, 20)
and (1) is why we just always use toarray()to get an array version out
regardless of whether it's local or distributed
when generating a reference image as you're discussing in #287 is it not possible to just call .mean().toarray()?