These are chat archives for thunder-project/thunder

10th
Nov 2015
Kyle
@kr-hansen
Nov 10 2015 01:30

So I've tried running the setup.py from the source distribution of the latest version of Thunder (0.5.1) while I had Spark 1.5.0 and Anaconda 2.2.0 active. Just as before, it seems I can open Thunder correctly with spark and anaconda active in my files. Is there anywhere in the setup scripts that I need to define what versions of Spark/Hadoop/Python to look for before running setup.py build and setup.py install?

It seemed to setup correctly with the files being placed in the appropriate folder. However I can't load the fish-series example. Working with the IT guys on the cluster, they suggested it is a Hadoop dependency issue, but trying to build from source with Spark 1.5.0 and Anaconda 2.2.0 from the hadoop dependencies didn't seem to solve the problem.

Any suggestions on how to get Thunder running on a cluster with Hadoop 2.6 installed?

Aaron Kerlin
@aaronkerlin
Nov 10 2015 16:05
The sortByKey() method seems to throw an error anytime I use it on an Image object right after using filterOnKeys(...). The error traces back to: "File "build/bdist.linux-x86_64/egg/thunder/rdds/data.py", line 447, in <lambda>
TypeError: 'int' object has no attribute 'getitem'". sortByKey() works just fine if I use it directly on the underlying RDD. Any ideas? Thanks
Jason Wittenbach
@jwittenbach
Nov 10 2015 21:19
hey @aaronkerlin, I think you’ve found a bit of a bug…or at least an inconsistency
theoretically, all keys in Thunder should be tuples
however, when loading an Images object, the key represents the frame, which is just an integer
it looks like sortByKey() is one function where is actually relies on the keys being tuples
tarun joshi
@26tarun
Nov 10 2015 21:21
Hi, we are doing a project on real time HD image analytics using thunder and spark, i believe when i performed the example PCA algorithm for our facial recognition test data "https://github.com/26tarun/iPythonNotebooks/blob/master/PCA.ipynb" the results are not analogous to what shown under the link "http://thunder-project.org/thunder/docs/tutorials/factorization.html#pca" . We converted our images to .bin file using the thunder APIs only , please help/guide what is the mistake we are doing?
Jason Wittenbach
@jwittenbach
Nov 10 2015 21:22
so a quick work-around is just to wrap the keys in single-element tuples before calling sortByKey()
that woud look something like:
data = tsc.loadImages(…)
result = data.applyKeys(lambda k: (k,)).filterOnKeys(…).sortByKey()
for me, calling sortByKey() gives the error whether or not I call filterOnKeys() first, so hopefully this will solve your problem from now until we can fix this
Jason Wittenbach
@jwittenbach
Nov 10 2015 21:47
@26tarun It looks like you’re using Thunder correctly. The main difference I see between your analysis and our example is in the results: The weights that we find on our example data set are much larger than those that you find. This probably just has to do with the range on the input signals. Trying playing around with the scale parameter in the calls to Colorize to see if you can’t get some brigher colors — the polar colorization scheme maps the combined magnitudes of the weights to the “value” (i.e. bright to dark), so your small weights are all just coming out as variations on black.
also, in the example analysis, the different images come from different planes in a 3D volume, which is why we use the max-projection (amax) for the final plot. If the individual images in your dataset do not have a simliar relationship, then this might not be appropriate for you.
Jason Wittenbach
@jwittenbach
Nov 10 2015 21:53
@kkcthans Loading the example data on a cluster can be tricky. The problem is that tsc.loadExample assumes that the example files are in a fixed location within the Thunder directory, but the relative path to this folder might be different on different machines in your Spark network (driver, master, worker, etc)
To load data, you need path that gets you do the data from all of the nodes in your cluster
the example data lives inside the Thunder directory at path_to_thunder/utils/data/mouse/images/ (for example for the mouse imaging data) — or something at least close to that depending on how recent your version of Thunder is
So if your installation of Thunder is in some mounted network drive so that all of the computers can use that absolute path

then you should be able to load it with

data = tsc.loadImages(path)

where path is that absolute path that gets you to the data on all of your nodes

Jason Wittenbach
@jwittenbach
Nov 10 2015 21:59
This could be your problem….or it could be a Hadoop version thing; we would have to ask @freeman-lab about that
Jason Wittenbach
@jwittenbach
Nov 10 2015 22:15
@shansrockin Is the example dataset that you made a 1D array? I think the problem is that your keys are not ending up at tuples — you can see that when you do data.first(), you get (0, array([…])), but for a valid Thunder object, this should be ((0,), array([…]))
as a work around, try doing
data = tsc.loadSeries(…).applyKeys(lambda k: (k,))
(sorry for the delayed replies, btw!)