These are chat archives for nextflow-io/nextflow

6th
Jun 2017
Shellfishgene
@Shellfishgene
Jun 06 2017 12:28
Does nextflow know a task was killed because it ran out of time?
Shellfishgene
@Shellfishgene
Jun 06 2017 13:02
@ewels In your RNAseq workflow you use a conda environment, but load this only before running nextflow, right? will the jobs on the cluster then also have this loaded? Or is that just the case with your scheduler? (SLURM I guess)
Paolo Di Tommaso
@pditommaso
Jun 06 2017 13:03
nope, this is generally delegated to the bach scheduler
thus when a task is killed for NF is just an execution error
Shellfishgene
@Shellfishgene
Jun 06 2017 13:04
@pditommaso So when I use retry with more cpus/ram options, it will also resubmit that if it's actually a program error?
Paolo Di Tommaso
@pditommaso
Jun 06 2017 13:05
generally when a task is killed by the scheduler it return a specific error code (eg 140 by the SGE)
you can use that info to reschedule your task properly, see here
Shellfishgene
@Shellfishgene
Jun 06 2017 13:16
Thanks!
Paolo Di Tommaso
@pditommaso
Jun 06 2017 13:17
+1
Shellfishgene
@Shellfishgene
Jun 06 2017 13:27
Hmm, what's an easy way to find out the error code our scheduler uses for limits exceeded? I can't find it in the documentation...
Paolo Di Tommaso
@pditommaso
Jun 06 2017 13:29
Specify a timeout of 1 minute, create a job sleeping for 5 minutes, submit and.. wait :)
Shellfishgene
@Shellfishgene
Jun 06 2017 13:30
yes, but where do I see the error code? I just get stdout and stderr text files...
Paolo Di Tommaso
@pditommaso
Jun 06 2017 13:32
If you use NF it will print in the error message, otherwise you need to capture he exit status somehow
Shellfishgene
@Shellfishgene
Jun 06 2017 13:43
Hmm, I created a 3 min sleep job with a 1 min queue limit and used NF to submit. The job is already killed, but NF is still waiting for it?
Paolo Di Tommaso
@pditommaso
Jun 06 2017 13:44
Weird, move into the task work for
What's the content of the file .exitcode
?
Shellfishgene
@Shellfishgene
Jun 06 2017 13:46
It's not there
But there is a .command.log with the error from the scheduler.
Acutally...
#!/usr/bin/env nextflow

process sleep {
    """
    sleep 120
    """
}
Does this even work?
Paolo Di Tommaso
@pditommaso
Jun 06 2017 13:49
that should not happen, there are two reasons: 1) the task is still running or 2) it has been killed in hard way by your cluster thus there's no wait to capture the exit status (and NF will stop after some minutes)
what batch scheduler are you using ?
Does this even work?
yes sure
Shellfishgene
@Shellfishgene
Jun 06 2017 13:55
I tried again, and there' s no exitcode file. That means it must be option 2. I'm using the NQSII I wrote the executor for.
%NQSII(INFO): Batch job received signal SIGKILL. (Exceeded per-req elapse time limit)
By 'killed the hard way' you mean SIGKILL vs SIGTERM?
Paolo Di Tommaso
@pditommaso
Jun 06 2017 13:58
ah, yes :)
Shellfishgene
@Shellfishgene
Jun 06 2017 13:59
I guess making NF check for that line in the stderr would take a lot of additional code...
Paolo Di Tommaso
@pditommaso
Jun 06 2017 14:00
I don't know about NQSII but being a branch of SGE I would expect that it works in a similar manner
Shellfishgene
@Shellfishgene
Jun 06 2017 14:00
PBS apparently sends sigterm first and then sigkill
Paolo Di Tommaso
@pditommaso
Jun 06 2017 14:00
the SGE when it need to kill a job first send a SIGUSR1 message (soft kill)
exactly, then after a few seconds send SIGKILL (hard kill)
Shellfishgene
@Shellfishgene
Jun 06 2017 14:01
Our PBSII qdel sends just SIGKILL by default, you have to add option -g to make it send SIGTERM first.
Paolo Di Tommaso
@pditommaso
Jun 06 2017 14:02
ah, perfect
Shellfishgene
@Shellfishgene
Jun 06 2017 14:02
But I guess I can't change what the scheduler does when the time is exceeded
Paolo Di Tommaso
@pditommaso
Jun 06 2017 14:03
I think you will need to add that parameter in the qsub cmd line
Shellfishgene
@Shellfishgene
Jun 06 2017 14:04
But that won't change the behaviour of the scheduler? Or does NF kill the job before the time runs out?
Paolo Di Tommaso
@pditommaso
Jun 06 2017 14:04
ah wait
you are right, qsub is not involved on that, there should be an equivalent option on the qsub command line
Shellfishgene
@Shellfishgene
Jun 06 2017 14:05
Of course there isn't, I checked. Maybe it's an option the administrator needs to change.
Paolo Di Tommaso
@pditommaso
Jun 06 2017 14:06
likely
Shellfishgene
@Shellfishgene
Jun 06 2017 14:06
So can I change the errorStrategy { task.exitStatus == 140 ? 'retry' : 'terminate' } line to retry only when there is an unkown error code? I guess that would work.
Paolo Di Tommaso
@pditommaso
Jun 06 2017 14:08
Yes, try using Integer.MAX_VALUE for that
Shellfishgene
@Shellfishgene
Jun 06 2017 14:12
Replace 140 by Integer.MAX_VALUE?
Paolo Di Tommaso
@pditommaso
Jun 06 2017 14:13
Yes
Shellfishgene
@Shellfishgene
Jun 06 2017 14:29
[7c/2f7e45] Submitted process > sleep
WARN: Process `sleep` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
[33/6178a2] Re-submitted process > sleep
:)
Paolo Di Tommaso
@pditommaso
Jun 06 2017 14:30
Great!
Shellfishgene
@Shellfishgene
Jun 06 2017 14:31
It only takes quite long to notice the task ist killed...
Paolo Di Tommaso
@pditommaso
Jun 06 2017 14:32
Yes, you should solve the sigterm signal with your cluster to have NF react faster
Shellfishgene
@Shellfishgene
Jun 06 2017 14:32
I can't change the polling interval?
Paolo Di Tommaso
@pditommaso
Jun 06 2017 14:35
Yes check in the documentation, executor config settings
But settings to low you can have error false positive, depending your file system
Phil Ewels
@ewels
Jun 06 2017 15:39
@Shellfishgene - as said above, software is generally handled separately. RNA pipeline works with environment modules by default with SLURM, or conda / native cmd line, or docker.
Which one depends on the profile used when running
Shellfishgene
@Shellfishgene
Jun 06 2017 15:42
@ewels I have stuff installed via conda, but don't have it in the module system. I'm having some trouble getting the scheduler to export the corrent PATH on the nodes, but that's only a local problem then I guess.
Félix C. Morency
@fmorency
Jun 06 2017 18:00
or singularity :D
Karin Lagesen
@karinlag
Jun 06 2017 21:16
huh
seems that
"""
"""
Channel
.fromFilePairs( params.reads, size:params.setsize )
.ifEmpty { error "Cannot find any reads matching: ${params.reads}" }
.into{ fastqc_reads, read_pairs }
"""
doesn\t work?
Félix C. Morency
@fmorency
Jun 06 2017 21:18
You need a ; in the .into{}, not a comma
Paolo Di Tommaso
@pditommaso
Jun 06 2017 21:18
exactly
Karin Lagesen
@karinlag
Jun 06 2017 21:18
ah :)
thanks!