These are chat archives for thunder-project/thunder

23rd
May 2016
Jeremy Freeman
@freeman-lab
May 23 2016 00:11
hi @evaristoc ! happy to field any questions you have here
mobcdi
@mobcdi
May 23 2016 11:58
This message was deleted
@freeman-lab Can I ask a few questions as well?
Kyle
@kr-hansen
May 23 2016 13:10
@mobcdi Go ahead and ask any questions. Anyone who may know the answers will happily respond, whether it is @freeman-lab or somebody else. I've gotten a lot of good help from this chatroom
mobcdi
@mobcdi
May 23 2016 14:15
Thanks, 1)Where does thunder fit in with the other elements mentioned in this room like lightning (lighting), bolt and are there others in the ecosphere that I should know about? 2)What would a thunder set up usually consist of? 3)Does Yarn or Mesos help or hinder thunder on Spark?
Kyle
@kr-hansen
May 23 2016 16:14
@mobcdi I can't answer all your questions, but I'm happy to try:
1) I have only personally really worked with Thunder and Bolt. Bolt is meant to be a basic manipulation for distributed arrays. Thunder uses Bolt for its distributed array management. From what I understand, early versions of Thunder included the functionality of what is now Bolt. However, it was spun off as its own thing to allow Thunder and Bolt to focus on and tackle different problems that occur with distributed arrays/images/time series.
2) By "thunder set up" are you suggesting the environment you run it on? I think a lot of people use Amazon S3. I personally have it running on my institution's super computing cluster. We have a hadoop cluster that I run my Spark/Thunder jobs on.
3) If you are using Thunder 1.0+, it is intended to be independent of your Spark environment, with the potential in the future for different backends that are not Spark, but I could be wrong on that account. That being said, I run Thunder on a Spark cluster with Yarn in client mode and things work well for me.
Jeremy Freeman
@freeman-lab
May 23 2016 16:17
amazing answers @kkcthans ! you should be writing our blog posts about this stuff =)
re (1) that's exactly the right overview of how thunder and bolt fit together, lightning is a library for web-based visualization in javascript and is very cool but basically entirely independent, except that the developers all work together and we do sometimes use them together
mobcdi
@mobcdi
May 23 2016 16:31
Thanks @kkcthans and @freeman-lab . Does thunder work better with particular spark or hadoop versions?
Jeremy Freeman
@freeman-lab
May 23 2016 16:33
should be entirely independent of hadoop, as there's really no hadoop-specific functionality, and for spark it's tested with the most recent versions (1.5 or 1.6)
Kyle
@kr-hansen
May 23 2016 17:22
Is the example NMF in thunder-extraction using the same NMF code as the paper from Pnevmatikakis et al. (https://github.com/epnev/ca_source_extraction)? It says they provided it to both SIMA and Thunder. However, looking through the code had me guessing that was pre Thunder 1.0. Is their CNMF the NMF that got transferred over, or is it just a generic NMF for testing purposes?
It looks like @j-friedrich forked thunder-extraction at one point, so I was wondering if that is the version that made it in.
Jeremy Freeman
@freeman-lab
May 23 2016 17:31
@kkcthans nope not yet the NMF in there now is super generic, just ordinary NMF followed by thresholding and some minimal morphological splitting / clean up
but @j-friedrich is working on adding it! will end up as a separate method, probably CNMF
so yeah exactly the one there is really just for testing
i'm working now on some generic code for merging regions across blocks which will can probably be used by several algorithms
Kyle
@kr-hansen
May 23 2016 17:33
Cool. Thanks for clarifying that.
One other question I had, is how are you able to use the underlying BoltArray for an Images/Series object?
I was playing around on my forked repo with Bolt on implementing element-wise addition, subtraction, etc...
I got it working with just importing Bolt directly, but I'm not sure how to access those same methods through Thunder now that I have loaded my Images in
mobcdi
@mobcdi
May 23 2016 18:31
Are there size limitations on the use of series? e.g.I have a few mp4 files about 1hour long at 25fps. Could I use series to plot some measure/attribute on the entire file?
Kyle
@kr-hansen
May 23 2016 18:48
Also, I just realized that Thunder all ready has element-wise addition, subtraction, etc... built in via Thunder/base.py (element_wise). Haha, so that addresses what I discussed a few days ago above with @jwittenbach
@mobcdi I don't know for certain as I mostly work with images personally, but I don't see any reason programmatically why Thunder would be the bottleneck/limitation for a series. I work with images that are 1024x1024 pixels recorded at 20 fps for extended periods of time (40,000+ frames in some cases) and Thunder works for what I need. I'd suggest you try it out.
Kyle
@kr-hansen
May 23 2016 18:53
I have run into issues with data this size converting from image to series objects, but I'm pretty certain that is due to how my cluster is structured, not due to Thunder
mobcdi
@mobcdi
May 23 2016 18:57
What was the cluster structure issues as I'm hoping to spin up a tiny development cluster
Kyle
@kr-hansen
May 23 2016 19:01
It was based on the memory overhead for hadoop that I had my Spark cluster running on
For running it on yarn. I was able to solve the issue with the spark-submit tag spark.yarn.executor.memoryOverhead=1024. However, I'm still getting some types of memory issues related to the shuffle calls going from images to series on my cluster. I don't have a solution to that issue yet but am working on it
Jeremy Freeman
@freeman-lab
May 23 2016 19:05
@kkcthans ah you might want to look at this thunder-project/thunder#307
short story is that in image.toseries() you might improve performance by changing the size argument to something much smaller like 10
Kyle
@kr-hansen
May 23 2016 19:06
@freeman-lab It might be beneficial to include the functions included in Base.py & Data.py on the online documentation. I had been using that as my main resource for functions and function calls, but completely missed out on the methods available through there (plus(), minus() etc).
@freeman-lab good to know. I'll look at that and try changing the size argument and see if that helps
Jeremy Freeman
@freeman-lab
May 23 2016 19:09
@kkcthans and yes! i've been meaning to fix that, just opened an issue thunder-project/thunder-docs#2
and i saw your comment in the bolt chat about this! really awesome you did it, i want to look more closely at what you've done with @jwittenbach because it might be that we should take what you did and keep it in bolt and then get rid of the implementations currently in thunder
Kyle
@kr-hansen
May 23 2016 19:14
Sounds good. I did pretty much the same thing as what is currently in Thunder, I just used the operator signs for basically the same functions. I actually liked the idea of what is in thunder of making a more generic element_wise function for the mapping. That was cleaner to follow than what I did for bolt, so probably some combination of the two would be best overall
Jeremy Freeman
@freeman-lab
May 23 2016 19:14
ok awesome, we'll take a look at what you did and can then follow up in the PR you made to bolt
awesome to have you contributing! :smile:
Kyle
@kr-hansen
May 23 2016 19:15
As it is, it seemed like I wasn't able to directly access the elements I added to Bolt through an Images or Series object
Jeremy Freeman
@freeman-lab
May 23 2016 19:15
ah so yeah there's definitely no automatic propagation of all methods on the underlying array
Kyle
@kr-hansen
May 23 2016 19:15
I'm happy to be contributing. I'm learning as I go, but this is a fun place to start for me at the intersection of a lot of stuff I'm all ready working on doing.
Jeremy Freeman
@freeman-lab
May 23 2016 19:16
glad to hear!
so far we've opted to just explicitly add any methods we want to make available
and you can always access the underlying array by doing images.values
Kyle
@kr-hansen
May 23 2016 19:17
Ok, so is images.values basically the same as the underlying BoltArray?
Jeremy Freeman
@freeman-lab
May 23 2016 19:18
yup yup images.values is either a BoltArray or a numpy ndarray depending on whether you're in local or distributed mode
it's a lot like pandas if you've used that
one of the reason we don't propagate all methods is that we don't have distributed implementations of all the things you can do on a numpy ndarray
Kyle
@kr-hansen
May 23 2016 19:20
I haven't used pandas much, but good to know. I'm coming from a Matlab world...
Jeremy Freeman
@freeman-lab
May 23 2016 19:20
ah cool welcome to python
Kyle
@kr-hansen
May 23 2016 19:20
Ya, I was thinking about it would be quite a project to create distributed implementations of everything for ndarrays
Jeremy Freeman
@freeman-lab
May 23 2016 19:21
definitely =) for the Spark backend we decided to start with a fairly narrow set of operations
there's a really cool project called dask that's also done a lot of work on this, and we're hoping to eventually have thunder support either dask or spark as the backend
Kyle
@kr-hansen
May 23 2016 19:26
Good to know. Does dask work with Spark, or would all those functions eventually need to be written in bolt/thunder to work in Spark?
Jeremy Freeman
@freeman-lab
May 23 2016 19:26
ah so dask has it's own distributed engine
but it'd actually be quite easy because the DaskArray is a lot like the BoltArray
so depending on the engine you'd have an images.values that's either backed by spark (via bolt), by a local numpy array, or by a dask array
that's the plan anyway
Kyle
@kr-hansen
May 23 2016 19:29
Cool. That'd be nice to have the flexibility of both with Thunder down the road. But likely all those functions would need to be written into bolt for Spark use at some point though probably, right? I just don't want to embark on adding more of the ndarray like functions to bolt if it could eventually be piped in via something like Dask that would be added.
Jeremy Freeman
@freeman-lab
May 23 2016 19:31
oh i see what you're saying
adding a least a few common things to bolt in the short term couldn't hurt
Kyle
@kr-hansen
May 23 2016 19:34
No that makes sense. I was just trying to decide if it'd be worth trying to sit down and do a bunch at once, or just add functionality as I need it myself. I think I'll probably do the latter for the time being, unless I'm feeling I want a big coding project to start up
Jeremy Freeman
@freeman-lab
May 23 2016 19:35
cool yeah i'd probably go with the latter, i usually have more fun coding if i know i'm going to use it