These are chat archives for nextflow-io/nextflow

25th
May 2017
amacbride
@amacbride
May 25 2017 03:10
Oooh, I found an interesting bug, and I didn't see anything in the issues list about it. nextflow-io/nextflow#349
@pditommaso I'l change my UUID format to work around it, but it probably represents a class of possible bugs.
Paolo Di Tommaso
@pditommaso
May 25 2017 05:45
@amacbride LOL, I like this :) thanks for reporting the issue. I will provide a patch soon.
Robert Syme
@robsyme
May 25 2017 07:11

Hi all. I'm looking to run nextflow run . -with-mpi on a slurm cluster. My submission scripts look like this:

#!/bin/bash --login
#SBATCH --job-name=nf-chickpea
#SBATCH --time=2:00:00
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16

module load java
export NXF_CLUSTER_SEED=$(shuf -i 0-16777216 -n 1)
srun nextflow run -resume . -with-mpi

But the slurm output suggests that nextflow is expecting the OMPI_COMM_WORLD_RANK environment variable:

Missing `$OMPI_COMM_WORLD_RANK` variable -- it looks you are not running in a MPI environment
Missing `$OMPI_COMM_WORLD_RANK` variable -- it looks you are not running in a MPI environment
srun: error: z012: task 0: Exited with exit code 1
srun: error: z013: task 1: Exited with exit code 1

Are the Missing $OMPI_COMM_WORLD_RANK variable errors from nextflow?

Paolo Di Tommaso
@pditommaso
May 25 2017 07:15
Can't check now, try to grep for that variable in the nextflow launcher script
Let me know
Robert Syme
@robsyme
May 25 2017 07:16
Ah, good idea.
Yeah, it is. I'll find out what variables are available when using srun. Probably won't be too hard to add in an extra test. I'll talk to the HPC sysadmins and get back to you with a pull request (or an issue, at the very least).
It looks like an if/else split on whether nf finds itself on the master node or not.
Paolo Di Tommaso
@pditommaso
May 25 2017 07:22
Yes exactly, that variable report the index of the current node in your allocation
It's used by NF to distinguish the master instance from the workers
Try to figure out why is missing and/or if your cluster has an alternative variable for that
Robert Syme
@robsyme
May 25 2017 07:27
I /think/ it is $SLURM_LOCALID, but will check
Paolo Di Tommaso
@pditommaso
May 25 2017 07:31
Wait you need to use mpirun to launch NF not drink
Sub
Oops
Instead srun to run the wrapper
Paolo Di Tommaso
@pditommaso
May 25 2017 07:39
*drink = srun
Hope you understood I'm not drink, just using the mobile :)
Robert Syme
@robsyme
May 25 2017 07:41
Yeah, 10am seems too early for a drink (and I'm Australian!)
Unfortunately, OpenMPM isn't installed - only the SGI Message Passing Toolkit :(
Would you expect that even if I find a way to test for 'mater node-ness', there will be other problems that crop up because of the lack of OpenMPI?
Robert Syme
@robsyme
May 25 2017 07:48
Don't worry - nextflow is easy to compile - so I'll just try a bunch of stuff and see. Will report back.
Paolo Di Tommaso
@pditommaso
May 25 2017 07:53
Meeting
Robert Syme
@robsyme
May 25 2017 07:59
It works! I've got to head off to journal club, but I'll send a PR and you can have a look.
Robert Syme
@robsyme
May 25 2017 09:39
nextflow-io/nextflow#350
Karin Lagesen
@karinlag
May 25 2017 10:17
when I run on a slurm cluster, which file(s) in the work dir should I look at to check if my sbatch params are as I want them to be?
Paolo Di Tommaso
@pditommaso
May 25 2017 10:21
The top of the .command.run
Karin Lagesen
@karinlag
May 25 2017 10:22
hm figured it out, command.run
:)
but: sbatch -c, what does that one do...?
I am used to using -ntasks with slurm
although, I do think slurm is quite confusing regarding ntasks, cpus, ntasks-per-node etc
Paolo Di Tommaso
@pditommaso
May 25 2017 10:23
Can't check now, but instead of using xmd line options NF the equivalent slurm directives
Look for #SBATCH (if I'm not wrong) specification on top of that file
Karin Lagesen
@karinlag
May 25 2017 10:25
thats what I am looking at
it sends in #SBATCH -c
Robert Syme
@robsyme
May 25 2017 10:27
-c is --cpus-per-task
The --cpus-per-task is there to ensure that each 'task' gets enough CPUs on the same machine. The sbatch manual page gives an example where you have a job with 4 tasks, each requiring 3 cpus - 12cpus in total. Without stipulating --cpus-per-task 3, the schedule would think - "this person requires 12 cpus, so I'll give them 3 quad-core nodes".
If you have four tasks, at least one of the tasks is split across two nodes, which is not ideal. If you specify --cpus-per-task 3, the slurm scheduler will allocate you four quad-core cpus. One of the cpus on each node will be wasted, but at least each of your tasks gets 3 cpus on the same node.
Hope that makes sense.
Karin Lagesen
@karinlag
May 25 2017 10:36
I think so....
but... hm
ok, my cluster has nodes with 16 cps in each
it seems that if I specify a $task.cpus > than that, it gets denied on the cluster
sbatch: error: Batch job submission failed: Requested node configuration is not available
Robert Syme
@robsyme
May 25 2017 10:37
Exactly - SLURM assumes that you absolutely have to fit each task on a single node.
Karin Lagesen
@karinlag
May 25 2017 10:38
yeah, but my cluster doesn't demand that
I am running spades right now, and I have run that with -ntasks of 32 etc
Robert Syme
@robsyme
May 25 2017 10:38
Each of those 32 tasks needs to fit on a single node.
Karin Lagesen
@karinlag
May 25 2017 10:39
I believe that for our cluster one task gets one cpu
that is at least how I've been interpreting the cluster manual
16 should suffice, I can just run multiple jobs in parallell instead, but wouldn't mind boosting the count though
Robert Syme
@robsyme
May 25 2017 10:40
I think that's right, yeah. If you don't specify --cpus-per-task, the controller will just allocate one cpu per task.
Karin Lagesen
@karinlag
May 25 2017 10:43
hm
if I understand things correctly, I have now hit my first "not too happy about" thing re NF :smile:
Robert Syme
@robsyme
May 25 2017 10:44
Oh, this isn't a problem with NF - this is SLURM being strange, I think (note: I am not a NF dev).
Well, SLURM is designed for people that require very specific requirements about how their job is partitioned up across a cluster. Us bioinformatics people are usually more likely to just ask "Give me 32 cores, I don't really care where they come from".
Karin Lagesen
@karinlag
May 25 2017 10:47
yeah, but at the moment I can\t really understand how I can get NF to do exactly that
Paolo Di Tommaso
@pditommaso
May 25 2017 10:52
Not sure to understand, if your nodes have 16 CPUs each why are you asking for 32?
(for a single job)
Karin Lagesen
@karinlag
May 25 2017 11:15
because spades doesn\t care which node it runs on
it can run things on several nodes
Paolo Di Tommaso
@pditommaso
May 25 2017 11:17
Ah, this change things
If so you need to manage CPUs settings by using the generic clusterOptions setting
and specifing the proper slurm option
Karin Lagesen
@karinlag
May 25 2017 12:29
Can I have process specific cluster options?
Paolo Di Tommaso
@pditommaso
May 25 2017 12:31
during the same run ?
Karin Lagesen
@karinlag
May 25 2017 18:38
yes
and, btw, I am getting a really odd error msg

ERROR ~ Unable to parse config file: '/work/projects/nn9305k/software/Bifrost/nextflow.config'

Cannot get property 'l' on null object

Paolo Di Tommaso
@pditommaso
May 25 2017 20:34
there's something wrong in your config file.