These are chat archives for nextflow-io/nextflow

17th
Mar 2016
Robert Syme
@robsyme
Mar 17 2016 08:14
Oh wow. I was not expecting that Apache Ignite would do automatic topology detection. I just added another node mid-workflow and now everything goes faster. Amazing!!!
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:16
:)
I would suggest a tweet on that :)
Robert Syme
@robsyme
Mar 17 2016 08:16
:) Will do
Does it do the same in reverse (if a non-master node goes down)?
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:17
yep
Robert Syme
@robsyme
Mar 17 2016 08:17
I'm a bit scared to test it now that I'm running the workflow and spending real money
Fantastic!
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:18
currently working on scheduler work-stealing, so that new nodes can steal pending jobs from other nodes
Robert Syme
@robsyme
Mar 17 2016 08:18
Last year I wrote these ridiculous Ansible scripts to get dynamic-resizing SLURM clusters running on AWS. It was a bloody nightmare. Apache Ignite (as part of NF) is so far ahead.
Yeah, looking forward to seeing nextflow-io/nextflow@f4e6c0e in master
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:19
not so far, I need to polish some things
are you launching it on AWS ?
Robert Syme
@robsyme
Mar 17 2016 08:20
No rush - I have a habit of breaking jobs into small pieces and then reassembling at the other end anyway.
The current job is running on GCE
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:20
ah, cool
the real goal is to have a dynamic resizable cluster depending on execution metrics
Robert Syme
@robsyme
Mar 17 2016 08:21
I was watching the log and saw "[Topology water]" go by and though to myself "No, surely not...". Launched a new node and less than a minute later, it was running jobs.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:22
Yes, Ignite has fantastic features
just to know, GCE does support tcp multicast ?
Robert Syme
@robsyme
Mar 17 2016 08:22
No, it doesn't look like it :(
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:23
I see, cloud provides don't like it
Robert Syme
@robsyme
Mar 17 2016 08:23
Had to do -cluster.join path
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:23
yep
Robert Syme
@robsyme
Mar 17 2016 08:23
... but everybody has to have NFS anyway, really.
No biggie
(or gluster, whatever)
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:23
wait, how do you share files in GCE?
Robert Syme
@robsyme
Mar 17 2016 08:24
Shared file system over NFS
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:24
Is it provided by GCE or have you installed it?
Robert Syme
@robsyme
Mar 17 2016 08:25
Installed it (with ansible). NFS on most modern distros is pretty painless, really.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:25
OK
Robert Syme
@robsyme
Mar 17 2016 08:26
How will you do dynamic resizable clusters? Have launcher scripts for a handful of the big cloud providers?
I suppose there is boto/jcloud libraries that make it a little easier.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:28
On amazon the idea is to use the Elastic load balancing service to launch new instances on demand
no, w/o using fat libraries
Robert Syme
@robsyme
Mar 17 2016 08:29
gotcha
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:30
you just need to say I need less/more instances as the ones already running
the nextflow stack is designed to be easy enough to be configure with a few lines of bash
the only problem is the shared file system
on AWS I'm waiting for EFS
in the meanwhile nextflow uses a S3 bucket to share the data
Robert Syme
@robsyme
Mar 17 2016 08:38
On GCE, I suppose you can just take a snapshot of your worker node and then use that to create an autoscaling instance group. As long as NFS works for the first one, it will work for the others.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:40
have you ever tried this
I mean in place of NFS?
Robert Syme
@robsyme
Mar 17 2016 08:42
I saw that, but I suspect that NFS or glusterFS will have better performance. Note that I haven't actually tested this theory...
Using buckets would certainly be better for very large data sources or when you're not sure about how large the work directory is going to be.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:44
Gluster surely, but not so sure about NFS.
In my tests it performs horribly with a real medium/large workload
Robert Syme
@robsyme
Mar 17 2016 08:45
Ah. I will test this next time :)
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:45
the problem is that it is not backed by a real storage, so you will end having n-1 client reading/wiring to the same node
Robert Syme
@robsyme
Mar 17 2016 08:51
I've just read that you can add gcsfuse mounts to /etc/fstab which means that newly spun up instances can have connected file systems after boot.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:52
yep, I was just wondering if you had some experience with it. I would like to test GCE in the near future
Robert Syme
@robsyme
Mar 17 2016 08:58
I'll certainly use gcsfuse next time. One benefit of using a bucket is that I could kill all the nodes as soon as the job is done without having to worry about rsyncing the results back.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 08:59
make sense
in that case I would suggest to use process.scratch=true in your nextflow scripts
Robert Syme
@robsyme
Mar 17 2016 09:00
Can that go in ~/.nextflow/config?
Paolo Di Tommaso
@pditommaso
Mar 17 2016 09:00
sure
Jason Byars
@jbyars
Mar 17 2016 19:53
if process.scratch=true is used, does a pipeline still need to be launched from a folder shared with all cluster workers? I.E. does that directive simply specify the folder for the worker to run in, or does that trigger a copy of all job related scripts to the scratch folder on the worker?
Paolo Di Tommaso
@pditommaso
Mar 17 2016 20:32
@jbyars Yes, a shared folder is needed. The difference is that the computation is executed on the node local storage automatically copying the outputs in the shared directory
however, depending the file system, inputs are not copied but just symlinked in the local task directory
Jason Byars
@jbyars
Mar 17 2016 20:46
Ok, I can work with that. The other part of my scheme relies on cfncluster. Does the beforeScriptdirective just run on the master node, or does it run on all the workers as well? Right now I'm thinking in terms of input defines all possible files to be processed, when decides which files actually need to be processed. In the beforeScript directive it would be nice to look at how many items passed when and issue a cluster resize. Is this possible?
Paolo Di Tommaso
@pditommaso
Mar 17 2016 20:50
you may think to beforeScript as a prolog of your command script
thus it runs on all nodes
however I'm not sure you can manage to resize the cluster in the way you are proposing
Jason Byars
@jbyars
Mar 17 2016 20:51
Probably not.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 20:52
however it sounds interesting, clustering resizing is something on which I'm planning to work in the near future
Jason Byars
@jbyars
Mar 17 2016 20:52
But, I think this clears up enough I can simplify my design.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 20:53
Does cfncluster deploy any kind of cluster engine such as slurm or sge ?
Jason Byars
@jbyars
Mar 17 2016 20:53
yes slurm, sge, pbs, etc. on AWS.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 20:54
ah
is it able also to resize the cluster dynamically ?
Jason Byars
@jbyars
Mar 17 2016 20:54
and the auto resizing works
the only catch I've found is if torque goes down to 0 workers, it can't recover
Paolo Di Tommaso
@pditommaso
Mar 17 2016 20:55
sounds interesting, I didn't know that
Jason Byars
@jbyars
Mar 17 2016 20:55
Yep, and they've finally updated the available list of images, so you don't have to run on CentOS 6
Paolo Di Tommaso
@pditommaso
Mar 17 2016 20:56
what is controlling the resizing? In an ideal world it should be automatic depending the node metrics (cpus usage, etc)
Jason Byars
@jbyars
Mar 17 2016 20:58
It's all intergrated with Cloudwatch. I don't want to butcher the explanation, but the idea is the job queue is monitored. If the queue has more than x items for y time a resize rule is triggered and the cluster grows
Paolo Di Tommaso
@pditommaso
Mar 17 2016 20:59
Well, if so it should work out of the box. You don't need any special logic in nextflow to manage that
Jason Byars
@jbyars
Mar 17 2016 21:01
right, I don't have to have special logic. But, the growth rules are not based on a multiple of queue size
It's more like if Queue is larger than 5, add 2 nodes, wait 2 minutes and see how many jobs are queued, repeat
So if I know I'm adding 100 jobs, and my scale up rules are conservative, it can take a while for the cluster to grow to an appropriate size.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 21:03
I see
Jason Byars
@jbyars
Mar 17 2016 21:03
You can issue resize commands to deal with these situations.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 21:03
interesting
I guess, some cfncluster command to be executed in the master node, right?
Jason Byars
@jbyars
Mar 17 2016 21:05
Actually I haven't tested that. Normally, you're running the cfncluster commands on a different host. It builds, manages, and destroys the cluster for you by setting up rules in AWS.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 21:07
Does it install also a shared file system?
Jason Byars
@jbyars
Mar 17 2016 21:07
It sets up both shared and ephemeral local disk. It can also setup bucket access.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 21:10
I will give a look to the documentation. It could be interesting to have a proper support for CnfCluster autoscaling in nextflow
Jason Byars
@jbyars
Mar 17 2016 21:11
This is what I'm trying to finish building Jenkins to monitor for work. Jenkins spawns cfnclusters as needed and makes them pull a nextflow repo of all my generic workflows. Then Jenkins launches the nextflow workflows based on what work it finds. Clusters shrink when work is done and I don't run up a crazy bill.
I'm not sure you'll have to add anything, but you should have a look. It might save you some work.
Jason Byars
@jbyars
Mar 17 2016 21:24
Conceptually, I've never figured out how to move those jobs onto a head node. Everything I come up with always needs an always on, or almost always on external node to decide when there is work to be done. I suppose I could implement a sophisticated set of Lambda functions, but that would bind me to AWS.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 21:26
I need to leave now, but I would like to continue this discussion on tomorrow if you are available
Jason Byars
@jbyars
Mar 17 2016 21:27
Don't worry I'll be around whenever I can.
Jason Byars
@jbyars
Mar 17 2016 22:55
Should moreFiles = Channel.fromPath( 'data/**/.fa' ) from the example on the Channels page be moreFiles = Channel.fromPath( 'data/**/*.fa' )? It looks like to me moreFiles = Channel.fromPath( 'data/**/.fa' ) would only emit files named .fa in sub-folders of data.
Paolo Di Tommaso
@pditommaso
Mar 17 2016 23:35
Yes, that's a typo. Thanks for pointing it out