These are chat archives for thunder-project/thunder

6th
Mar 2018
Boaz Mohar
@boazmohar
Mar 06 2018 02:23
@mellertd As far as I understand, yarn mode is different for how you deploy your spark cluster, in Janelia it is in standalone mode. As far as thunder is concerned, they are both the same. You would need a Jupyter notebook and a way to get a spark context in it, which might be different then how it is done in Janelia. Is this the part you need help with?
mellertd
@mellertd
Mar 06 2018 03:34
I guess the question had more to do with the virtual environment issue. I recall at Janelia, Thunder and dependencies were installed on all the nodes, so you could launch standalone clusters and everything worked. For various reasons, we don’t want to maintain installs on all the nodes and would rather let users manage their own environments. This is possibwith yarn mode, but it is rather clunky and is not officially supported in interactive mode (i.e. in Jupyter)
I am just wondering if anyone had any experience with this, because there seem to be many degrees of freedom to get things working well
Boaz Mohar
@boazmohar
Mar 06 2018 13:20
I ah
I have used a virtual environment with thunder, you should change the environmental variable SPRAK_PYTHON and both the driver and workers would see the same python virtual environment.
mellertd
@mellertd
Mar 06 2018 13:34
It is actually much more
Complicated than that
This is whatwe are currently tring:
Hm this chat is unusuable on iOS safari, I’ll paste a link when I can get to my laptop
Our tests seem to work with virtualenv, but it is very slow. Haven't gotten it to work with Conda yet, but it should not work any differently. I am currently trying to figure out how we might speed things up
Boaz Mohar
@boazmohar
Mar 06 2018 14:42
I am definitely not an expert here, and have not used any of these using spark submit. But for interactive mode I have used this code to make sure I am in the right enviorenmt for the driver and workers:
import numpy
print(numpy.__file__)
def test1(x):
    import numpy
    return numpy.__file__
data = sc.range(10).map(test1).collect()
print(data[0])
and PYSPARK_PYTHON worked by pointing it to the virtual environment from conda: export PYSPARK_PYTHON=/groups/svoboda/home/moharb/anaconda2/envs/py35/bin/python