These are chat archives for thunder-project/thunder

16th
Jul 2015
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 12:13

Hello everyone.
My name is Alexandre Laborde and I am a Computer Science student currently working at the Champalimaud Foundation, a neuroscience research facility in Portugal.

I am trying to adapt the thunder project to run in our private cluster and I would like to get your input if possible.

Correct me if I’m wrong but, from what I can see from your code, basically the only components that have to be modified are the functions related to calling the Amazon (EC2) cloud servers so that they work with our servers and also setting up a spark cluster on the machines we have here.

I read in the FAQ section of the GitHub repo that you are planning to write a how to on making thunder run in private clusters. Is it out already and I missed it? If not, can you please give some guidelines on how to do this?

It's needless to say that I will gladly contribute to the project on GitHub.

Best regards,
Alexandre Laborde

andrew giessel
@andrewgiessel
Jul 16 2015 12:24
good morning @AlexandreLaborde !
I’m sure @freeman-lab will have more to say, but I wanted to point you to this repo: https://github.com/freeman-lab/spark-janelia
I think it is how they use spark on their clusters at janelia farm. Jeremy can clarify if not
it might be a good place to start to think about local deployment
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 12:25
Hello :smile:
thank you for responding
i will look into that
andrew giessel
@andrewgiessel
Jul 16 2015 12:26
great, and if it’s not a good fit, feel free to create an issue in the main thunder repo
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 12:27
ok thank you
andrew giessel
@andrewgiessel
Jul 16 2015 12:29
:thumbsup:
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 12:31
:+1:
if I can do something that other people can use i'll add it to thunder
Jeremy Freeman
@freeman-lab
Jul 16 2015 13:08
thanks for the interest @AlexandreLaborde ! can you say a little more what you mean by 'private cluster'?
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:12
We already have 16 servers which the investigators can submit their jobs to.
Some researchers here found out about thunder and want to use it. So, I was asked if thunder could be modified to run here instead of on Amazon Servers
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:15
what software stack is currently running on those 16 servers?
put another way, how are the researchers submitting their jobs to it?
maybe it's running sun grid engine + qsub? (a lot of academic / university clusters are)
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:17
as far as I know they are using grid
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:18
ok great
so first off, this is really a question of setting up Spark (on which Thunder is running), once Spark is set up on the cluster adding Thunder is easy
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:19
I have the green light to install spark in all computers
ok that's great news :)
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:20
but there's a question of how this will integrate with the existing submission
in what's called "standalone mode", described here http://spark.apache.org/docs/latest/spark-standalone.html
you basically set up one node as master and the rest as workers
and start spark, at which point spark jobs (including thunder jobs) can be submitted to the cluster
but while the spark cluster is running, in this mode, other users shouldn't be submitting jobs
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:22
I did not realized that when I read this a few day ago
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:22
there are other deployment options for spark better suited to multi-user environments, like mesos http://spark.apache.org/docs/latest/running-on-mesos.html
when you deploy on EC2 the entire cluster is for your spark jobs until you shut it down
andrew giessel
@andrewgiessel
Jul 16 2015 14:23
@freeman-lab for clarification, the reason that others shouldn’t run jobs at the same time is due to resource limits?
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:24
yeah basically, but that's just the default, e.g. this sentence "By default, it will acquire all cores in the cluster, which only makes sense if you just run one application at a time. "
you can configure limited resources for the spark cluster so that other jobs can run at the same time
but in my experience systems like grid engine are used to being able to have access to all workers on all nodes
so if spark is running on a portion of cores on a few nodes, the grid engine schedule needs to "know that"
that's why at janelia we ultimately set it up so a spark job is spun up on a subset of nodes and once deployed those nodes are unavailiable to the grid engine
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:27
from what i have seen from the platform that the researchers use, they have to select the computers where they want to run their code from a list of available machines
so I don’t believe that there any master allocating resources as they are needed
but I have to look into that
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:29
in that case it might be straightforward to install spark on all of them, have a script that launches the spark cluster on a subset, and then while it's running just make sure that subset is not available to researchers
not totally clear to me who/what manages which nodes are avaliable (usually that would be the grid engine)
more generally, i'm working on documenting all of this more fully, and would love to get your feedback once that's written!
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:30
I am almost sure this system only runs one job per computer since they only have 2Gb of RAM each
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:31
oh that's very little RAM!
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:31
very little indeed
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:31
have you or your colleagues considered using EC2?
andrew giessel
@andrewgiessel
Jul 16 2015 14:31
+1 for EC2
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:31
we're also working on integration with Google's cloud services, which will provide another deployment option
andrew giessel
@andrewgiessel
Jul 16 2015 14:32
costs can be as low as $0.20/hour/machine with spot instances
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:32
I dont thing they know that this machines are so weak
I will talk to them
I actually believe you know who they are @freeman-lab since they know you
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:33
oh nice =)
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:34
The group leader is Michael Orger, does that ring any bells ?
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:37
oh of course
well i'd be happy to hop on a skype with you / him / his team to discuss this stuff
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:37
if they decide to run this here anyway, how deep do I have to dig into the thunder code ?
that would be great :) I will talk to him about that
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:38
setting up spark is really independent of thunder
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:38
agreed
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:38
you'd basically follow the instructions here http://spark.apache.org/docs/latest/spark-standalone.html
which doesn't involve the thunder code base at all
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:39
so I only have to create my own spark-janelia
is that it ?
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:40
well, not exactly, spark-janelia is a set up of wrapper scripts
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:40
maybe not the best metaphor
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:40
but kind of!
the core action is happening in a set of scripts we wrote that get run on the sun grid engine to basically call the sequence of sbin/start-master.sh, sbin/start-slave.sh, etc
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:41
I meant the part that conects thunder to the spark runing on the servers here
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:41
very annoyingly i wasn't allowed to open source those scripts because of dumb proprietary grid engine stuff
but it should be easy to reproduce just by following those instructions on the spark deployment page
oh once the spark cluster is setup you can run thunder jobs just by calling the thunder and thunder-submit executables and specifying the IP address of the master
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:43
ok I will try that and in a few days I will return here to share my experiences
Jeremy Freeman
@freeman-lab
Jul 16 2015 14:43
ok terrific!
thanks
alexandrelaborde
@AlexandreLaborde
Jul 16 2015 14:44
no, thank you :)