These are chat archives for nextflow-io/nextflow

14th
Jun 2016
Andrew Stewart
@astewart-twist
Jun 14 2016 00:47
Hey, is something up with the get.nextflow.io url?
I keep getting "curl: (22) The requested URL returned error: 403 Forbidden"
Robert Syme
@robsyme
Jun 14 2016 02:12
@astewart-twist Looks ok from here...
Andrew Stewart
@astewart-twist
Jun 14 2016 03:22
Looks like my organization's firewall is classifying it as malware
is there an alternative installation method?
Paolo Di Tommaso
@pditommaso
Jun 14 2016 03:54
uh, that's bad
@astewart-twist yes, download it from github
but, I'm wrong or have you had a similar problem in that past?
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 07:56

Good morning!
My nextflow.config file contains some info for our SLURM:

process {
    executor='slurm'

    $ss_mapping {
        clusterOptions = "--account=hugues --time=10:00:00 --mem-per-cpu=3140 --cpus-per-task=6"
    }
}

Now, in the pipeline process itself, I'd like to implement some retry error strategy aka:

memory {2.Gb * task.attempt}
cpus {5 + 2*task.attempt}
time {10.hour * 0.5*task.attempt}

How does memory relates to mem-per-cpu?
How do I retrieve the time, mem-per-cpu and cpu-per-taskas variables to use in the pipeline?
Will the values specified in the pipeline overwrite those defined in the config file?

Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 08:02
Also perhaps a single of code would answer all my questions at once :)
Thank you
Paolo Di Tommaso
@pditommaso
Jun 14 2016 08:35
@huguesfontenelle Good morning
The memory directive is converted to a SLURM --mem option
Actually I don't know how --mem-per-cpu is related to --mem you will need to check the docs of SLURM
but that they are in conflict
however two things:
you can do the same with with the clusterOptions, I mean the config file you can put something like
clusterOptions = { "--account=hugues --time=10:00:00 --mem-per-cpu=${task.attepts * 3140} --cpus-per-task=6" }
Paolo Di Tommaso
@pditommaso
Jun 14 2016 08:40
the second, what's the benefit of using mem-per-cpu vs mem-per-job in your use case?
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 08:44
Each CPU has 3140.MB on our cluster. If I request, say "--mem-per-job=15Gb --cpus-per-job=2", I'll be billed for 5 CPU's even though I requested only 2. From a cost saving perspective I'm better off maxing the mem-per-cpu, and use cpu-per-job as my variable to play with
So I'll be adding cpus {5 + 2*task.attempt} ..
Good idea to specify this in the config file :)
Paolo Di Tommaso
@pditommaso
Jun 14 2016 08:47
make sense ..
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 08:49
and I'll try to fit the time better, so I'm scheduled faster. Not a problem right now, but if the cluster gets busy, I'd like the patient diagnostics not to be delayed too much
Paolo Di Tommaso
@pditommaso
Jun 14 2016 08:50
:+1:
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 08:51

from slurm doc:

--mem-per-cpu. --mem and --mem-per-cpu are mutually exclusive.

I had expected that

Paolo Di Tommaso
@pditommaso
Jun 14 2016 08:53
good to know
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 08:54
Thanks a lot
Paolo Di Tommaso
@pditommaso
Jun 14 2016 08:55
welcome
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 12:34
I voluntarily gave too little time (one minute) to a read mapping process (bwa-mem).
Apparently that did not trigger a fail. Instead the next process started, and failed because the BAM files were malformed.
What I wanted was the mapping to fail, and restart with, say, twice the time time {2.h * task.attempt}
Paolo Di Tommaso
@pditommaso
Jun 14 2016 12:54
!
too bad
but this means that slurm didn't return a non-zero exit status, nextflow can do little for that
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 13:19
For a success SLURM job, I get
Jobstate=COMPLETED ExitCode=0:0
while for a timeout I get
Jobstate=TIMEOUT ExitCode=0:1
(along with a lot moreinfo)
Paolo Di Tommaso
@pditommaso
Jun 14 2016 13:21
ugly..
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 13:22
I don't know how slurm is supposed to behave, just reading the output of scontrol show job 123456 here, after I submit it with sbatch
Is the 0:1 semi-colon separated exit code not standard?
Paolo Di Tommaso
@pditommaso
Jun 14 2016 13:23
no, that is not the problem
because nextflow reads the task exit status from a file .exitcode that is created by the job wrapper
what should happen is that when there's a timeout event , slurm sends a SIGINT signal to that job
this stops the task execution reporting an non-zero exit status
thus, it's really weird it exists in a 0
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 13:31
So I should contact the people behind the cluster and ask them why is my time out jobs not writing a non-zero code in .exitcode ?
Paolo Di Tommaso
@pditommaso
Jun 14 2016 13:32
I would suggest that. Is it an expected behaviour that when killing a job slurm report a zero exist status?
Phil Ewels
@ewels
Jun 14 2016 13:37
If it helps, our SLURM cluster emits 143 when jobs are killed due to running out of memory or time
Paolo Di Tommaso
@pditommaso
Jun 14 2016 13:38
thanks for the feedback, could it be related to some specific job or even cluster configuration
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 13:41
thank you.
any idea how to locate this .exitcode file ?
(I'm a bit new to slurm, etc)
Phil Ewels
@ewels
Jun 14 2016 13:42
It's in the work directory, but as it begins with a . it will be hidden on unix command lines typically
ls -a to list files including hidden files
Paolo Di Tommaso
@pditommaso
Jun 14 2016 13:42
exactly, that's a nextflow generated file, not by slurm
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 13:42
oh, it's a netxflow file?
I didn't understand that
thought it was slurm
I was just lauching slurm jobs without nextflow
Paolo Di Tommaso
@pditommaso
Jun 14 2016 13:44
ah, so sysadmins are your friends :)
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 13:53

OK I wrote and submitted this:

ch = Channel.from(3)
process timeout {
time "${v}s"
echo true

input:
val v from ch

script:
"""
echo "${v} seconds"
sleep 10
"""
}

The .exitcode file is empty

Paolo Di Tommaso
@pditommaso
Jun 14 2016 13:54
empty i.e. zero bytes ?
sure? try to wait some seconds
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 13:54
wait
1 byte
Paolo Di Tommaso
@pditommaso
Jun 14 2016 13:55
ok
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 13:55
I cat it
Paolo Di Tommaso
@pditommaso
Jun 14 2016 13:55
... and ...
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 13:56
sorry my bad
Paolo Di Tommaso
@pditommaso
Jun 14 2016 13:56
:)
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 13:56
0
Paolo Di Tommaso
@pditommaso
Jun 14 2016 13:56
yep, shared file systems do that
you can try to do this
move in that process working directory
then run
NXF_DEBUG=3 sbatch .command.run
send me the output when done
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 13:58
account specification required.. I'll need some time here..
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 14:08
well, I do get 143 ..
Hugues Fontenelle
@huguesfontenelle
Jun 14 2016 14:15
now that I check, the samples that I have in production (and lacked time) also exit with 143
But when I changed the time limit to the absolute minimum allowed by SLURM, in order to test my code, it did not exit that way.