These are chat archives for thunder-project/thunder

19th
Apr 2015
Vijay Mohan K Namboodiri
@vjlbym
Apr 19 2015 02:25
@tomsains Regarding trying to get Thunder to install on Windows, I didn't have any luck. I ran all the Windows equivalents that I could. I am curious how @GrantRVD got it to work. In the meantime, I got it to work on Ubuntu fine. There's still someone in my lab trying to get it to work on Windows. So any help would be great!
Vijay Mohan K Namboodiri
@vjlbym
Apr 19 2015 02:32

Also, could anyone please tell me how the memory allocation is actually done in Thunder? Just a brief primer would help. I tried to run a 400MB dataset locally on a computer with 16GB of RAM. I tried loadImagesAsSeries and it filled up my RAM and started overflowing into the swap! I tried to do the same thing by first loadImages and then converting to series. The loading works fine but the conversion is where the memory usage just explodes. Why is this? I am using a series of .tif files for the images.

Also, when I downsampled the temporal sampling by 3, the data loaded fine as a series (about 5GB of memory used), but then trying to run ICA constantly gave me an error where Java was running out of heap space. Is there a place in the source code where I can set the heap usage limit?

Are these problems simply a result of running Thunder on a local machine? Would running on EC2 eliminate such issues? In any case, it would be great to understand how the memory is actually being handled. Any reference in this regard would be a great help! Thanks a lot!

Grant R. Vousden-Dishington
@GrantRVD
Apr 19 2015 04:59
Admittedly, it took some more finagling with my environment variables and a couple changes to the pyspark executable to make it work, but it wasn't too much effort once I understood what I was looking for. I was able to get thunder to run the examples on the Getting Started section, so it looks like things are in order.
Grant R. Vousden-Dishington
@GrantRVD
Apr 19 2015 05:04
I use Windows Powershell on my Windows machine, so that's what I'd recommend for your labmate, @vjlbym . I can probably help if I have more information on what errors he's getting and what his command line is. Make sure he sets the environment variables necessary.
Vijay Mohan K Namboodiri
@vjlbym
Apr 19 2015 05:05

Could you please let me know what the changes to pyspark were? Did you just use a prebuilt Spark for Hadoop 1.x? Which version did you use?

I think this info could be quite useful for people trying to get it run on Windows. May also be helpful to include this info on the Thunder website. Thanks a lot! :)

Also, I am pretty sure the environment variables were set correctly. The issue is most likely due to Pyspark. I'll get my labmate to post on the forum soon. Thanks again!
Grant R. Vousden-Dishington
@GrantRVD
Apr 19 2015 05:14
@vjlbym Right off the top of my head, it sounds like your OUtOfMemoryException is a result of your Java settings, not thunder. First suggestion is to figure out what your max heap size is - I'm told it's 64MB by default on Windows, which could explain your problem, but don't quote me on that. You can use the -Xms and -Xmx arguments on java, with the commandline, to set the min and max heap sizes. For example, java -Xms512m -Xmx1024m .\<yourapplication> will set your initial and max heap size to 500 MB and 1 GB, respectively. If you want to make the change permanent, you can add those flags to a system level environment variable called _JAVA_OPTIONs.
Grant R. Vousden-Dishington
@GrantRVD
Apr 19 2015 05:20
@vjlbym My problems with pyspark stemmed from having both Python 2 and 3 installed, and 3 being my primary. To fix that, I had to change line 63 of pyspark2.exe to make sure it called ipython with version 2.7 instead of 3.4. Other than that, I installed the same Spark you did, pre-built for Hadoop 1.x, per the thunder installation instructions. The environment variables that seem important to set are SPARK_HOME, PYSPARK_PYTHON (especially, if you have multiple python versions installed), and perhaps others. Your labmate's error message should clear things up.
Jeremy Freeman
@freeman-lab
Apr 19 2015 13:52
@vjlbym @GrantRVD thanks for putting your heads together on this! we clearly need to get the Windows installation more straightforward (and better documented, if there are any necessary configs). i'll try to get a hold of some Windows machines for my own testing, but so far i suspect there might be two sets of issues (1) just getting spark and pyspark working on Windows and (2) installing Thunder on Windows. for (1) we can certainly provide advice in our docs, but should ultimately rely on Spark's own docs (and if needed make some changes on the Spark side!). for (2) it may additionally depend on whether you do a pip install or clone the repo, and if you can create a github issue with a full description of what you did and what the error was we can try to fix!
Jeremy Freeman
@freeman-lab
Apr 19 2015 13:59
@vjlbym :point_up: April 18 2015 10:32 PM the short story is that much of this is due to running things locally. there are a number of ways in which local use is sub-optimal, for both performance and memory reasons, and most people find a very non-linear improvement in performance when going from a local machine to even a small cluster. @GrantRVD 's suggestions should help the java heap errors in the short term, and we can try to debug other memory issues. but we are working on an entirely new set of abstractions that should make Thunder as performant as possible when running locally, but keep the same API across both local and cluster usage. if that works, I hope many of these problems will go away!
Grant R. Vousden-Dishington
@GrantRVD
Apr 19 2015 17:50
Thanks @freeman-lab. I agree with your main points. A lot of the problems, Windows or otherwise, in getting set up seem to come from the Spark installation and getting pyspark up and running. One part of the problem is the lack of any pip support for installing pyspark, so there are a lot of places the manual installation can go awry. I'm still having trouble getting thunder to run from the command-line, because Windows doesn't recognize it as an executable or python code (the thunder file in \bin has no file extension). If I run pyspark and then from thunder import ThunderContext to assign tsc, everything works fine, but it's a bit roundabout. It might be necessary to add a short .exe file to the bin to make things usable for Windows.
The good news regarding the Python 2 vs. 3 problem is that just a few days ago, pyspark was ported to Python 3 apache/spark#5173, so this opens up the possibility of making thunder compatible with Python 2 and 3. I'm not sure what changes would need to be made to accommodate the changes to pyspark, but if help is needed, I'm happy to serve as the Python 3 and Windows-user test subject. Just let me know what I need to do.
Vijay Mohan K Namboodiri
@vjlbym
Apr 19 2015 17:57

Thanks a lot, @GrantRVD and @freeman-lab. I'll get my labmate to post his issue on Github soon, most likely tomorrow. If/when it works on Windows, I can make a list of steps that we did, in case it is helpful for others in the future. We only have Python 2.7 installed through Anaconda.

As for the Java error, I set an environment variable in Ubuntu as a fix. But now it seems that I am getting another error that I don't understand. I am posting the full log as an issue on Github here: thunder-project/thunder#172

Again, thanks a lot for your help and for making Thunder :)

@GrantRVD That's curious: I didn't try running pyspark and then from thunder import ThunderContext . I only tried to run thunder as a command and then realized that it isn't an executable. I'll try this method and post here soon. Thanks again!
Grant R. Vousden-Dishington
@GrantRVD
Apr 19 2015 18:02

@vjlbym Give it a shot. Specifically, cd your command line to the pyspark directory and run .\pyspark.exe. Then, once you successfully get to the IPython prompt, use

from thunder import ThunderContext
tsc = ThunderContext(sc)

Then try to run the ICA example from the thunder homepage. If that completes successfully then I'd say you're about ready to go.

Richard A Hofer
@rhofour
Apr 19 2015 18:07
@freeman-lab Is there a reason that the Data object isn't in the documentation?