These are chat archives for nextflow-io/nextflow

12th
Dec 2018
Stephen Kelly
@stevekm
Dec 12 2018 00:11
@rsuchecki yes thanks so much!

when you use the -bg option it creates a .nextflow.pid file

@pditommaso are there any side-effects on Nextflow's execution behavior with this argument? Or is the .nextflow.pid file creation its only function?

Paolo Di Tommaso
@pditommaso
Dec 12 2018 07:24
@aunderwo I think it happens because it tries to update the spot prices, I'll put an explicit info message
Paolo Di Tommaso
@pditommaso
Dec 12 2018 07:34
@t-neumann it may be this nextflow-io/nextflow#856
Alexander Peltzer
@apeltzer
Dec 12 2018 08:24
Its currently not possible to add CLI options such as -ansi to your own ~/.nextflow/config ?
Paolo Di Tommaso
@pditommaso
Dec 12 2018 08:26
nope, that will become default at some point (soon)
Tobias Neumann
@t-neumann
Dec 12 2018 09:17
@aunderwo I'm running on on-demand instances, so anything spot related should not be an issue
@pditommaso I was actually running this on 50 test samples, 2 fastq files each + 3*4 index files, so the sheer amount of files should be quite handleable.
Alexander Peltzer
@apeltzer
Dec 12 2018 09:36
Ok nice :-)
Anthony Underwood
@aunderwo
Dec 12 2018 09:39
@pditommaso and @t-neumann I see this fairly regularly and based on @t-neumann 's comment it's not a spot pricing thing. @pditommaso what would be the best way to debug what Nextflow is waiting for when it's in the warm-up stage when using the awsbatch executor
Hugues Fontenelle
@huguesfontenelle
Dec 12 2018 10:54
I see that right now, while running local. All processes have exitcode 0, yet the Task monitor reports all processes as active (as in the snapshot), and the workflow does not complete.
Hugues Fontenelle
@huguesfontenelle
Dec 12 2018 11:03
(I see this "No more task to compute -- The following nodes are still active" after I CTRL-C. Before it just hangs)
Anthony Underwood
@aunderwo
Dec 12 2018 11:09
@huguesfontenelle I see the same output too when it hangs
Paolo Di Tommaso
@pditommaso
Dec 12 2018 11:35
use jstack <pid> and open an issue with the printed stack trace
Tobias Neumann
@t-neumann
Dec 12 2018 12:00

@aunderwo @pditommaso any suggestion how one could debug this or where to look?

Dec-12 11:32:52.410 [Task monitor] DEBUG n.processor.TaskPollingMonitor - No more task to compute -- The following nodes are still active:
[process] centrifugePaVE
  status=ACTIVE
  port 0: (queue) closed; channel: -
  port 1: (value) bound ; channel: index
  port 2: (cntrl) -     ; channel: $

[process] centrifugeRefSeq
  status=ACTIVE
  port 0: (queue) closed; channel: -
  port 1: (value) bound ; channel: index
  port 2: (cntrl) -     ; channel: $

[process] centrifugeENA
  status=ACTIVE
  port 0: (queue) closed; channel: -
  port 1: (value) bound ; channel: index
  port 2: (cntrl) -     ; channel: $

no job / log ending up on the aws end, so I'm a bit lost on where even to start

Paolo Di Tommaso
@pditommaso
Dec 12 2018 12:31
does it run correctly w/o batch ?
Tobias Neumann
@t-neumann
Dec 12 2018 13:47

it runs correctly with the slurm profile
what I have just checked and this is very odd - when I define the containers like this:

process {
    // Process-specific docker containers
    withName:centrifugePaVE {
        container = 'docker://obenauflab/virusintegration:latest'
    }

    withName:centrifugeRefSeq {
        container = 'docker://obenauflab/virusintegration:latest'
    }
    withName:centrifugeENA {
        container = 'docker://obenauflab/virusintegration:latest'
    }
}

they get submitted to AWSbatch and then crash there.

CannotPullContainerError: API error (400): invalid reference format

so I figured the problem is what I had last time, that I have to get rid of the docker:// prefix
but this is exactly when I run into the problem that the queue is stuck and nothing gets submitted to AWS batch

can you think of something that can cause this?

Stephen Ficklin
@spficklin
Dec 12 2018 16:17
Hi, hopefully this is a quick question. I'm getting this error when running my nextflow script and I am at a loss for what is causing it
No signature of method: users() is applicable for argument types: (ConfigObject) values: [[:]]
Anyone with any thoughts?
I'm not calling a users() function in my nextflow script.
Paolo Di Tommaso
@pditommaso
Dec 12 2018 16:18
maybe users { .. something .. }
Stephen Ficklin
@spficklin
Dec 12 2018 16:20
Yes, you're right. I feel stupid now.
Paolo Di Tommaso
@pditommaso
Dec 12 2018 16:20
shit happens ;)
Stephen Ficklin
@spficklin
Dec 12 2018 16:20
I had a setting in my config file that was the cause.
Thanks for listening.
Paolo Di Tommaso
@pditommaso
Dec 12 2018 16:20
welcome
Stephen Kelly
@stevekm
Dec 12 2018 19:55

@stevekm it just runs NF in the background https://github.com/nextflow-io/nextflow/blob/0f9baf6d577e6d8c1872cee6713682d41a8e5693/nextflow#L164-L168

this is kinda confusing because if Nextflow is running in the background, then why does SLURM wait for it to finish after I submit it to run inside an sbatch job? I thought you had to run something like wait <pid> to get that behavior

Stephen Kelly
@stevekm
Dec 12 2018 20:02

Also I am getting a lot of errors like this on our SLURM cluster:

ERROR ~ Error executing process > 'eval_pair_vcf (SampleID.chr3.MuTect2)'

Caused by:
  Failed to submit process to grid scheduler for execution

Command executed:

  sbatch .command.run

Command exit status:
  1

Command output:
  sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

When I check the trace for that task it looks like this:

$ paste <(nheadt trace.txt) <(grep 'b0/3c7' trace.txt | t - )
     1    task_id    1944
     2    hash    b0/3c75ea
     3    native_id    -
     4    process    eval_pair_vcf
     5    tag    SampleID.chr3.MuTect2
     6    name    eval_pair_vcf
     7    status    (SampleID.chr3.MuTect2)
     8    exit    FAILED

no native_id means Nextflow thinks the job did not get submitted to SLURM, I think, right?

However when I check the work directory, there are results and logs that show that the task actually did get submitted and run on SLURM.

So I am wondering how Nextflow could report this error, if the job actually did get submitted and run successfully (even has .exitcode file with contents of '0')

Any ideas? I am getting a lot of such errors lately and pipelines crashing due to timeout from the SLURM scheduler. Even running 'sinfo' commands in the terminal are sometimes giving similar errors. Maybe Nextflow is not waiting long enough for the scheduler to respond? I saw exitReadTimeout config listed here but was not sure if that controls the amount of time Nextflow waits for SLURM to respond to 'sbatch' or not.

Tobias "Tobi" Schraink
@tobsecret
Dec 12 2018 21:54
Yeah been having similar issues on the cluster @stevekm. Stopped having these when I reduced to squeue size to 200 and setting min/max heap sizes for the JVM
But I have been getting loads of concurrentModificationError messages lately in my workflow which at this point just attempts to download genomes and have not been able to figure out what the problem is there.
Stephen Kelly
@stevekm
Dec 12 2018 22:22
@tobsecret are you running this on Big Purple? My queue size is 50 but I have been running multiple pipelines at once
how are you modifying the JVM?
Tobias "Tobi" Schraink
@tobsecret
Dec 12 2018 23:10
Like so:
NXF_OPTS='-Xms512m -Xmx2G' nextflow run pipeline.nf
Yes, am running these on big purple - not in the last couple of days but have been over the last couple of weeks
Stephen Kelly
@stevekm
Dec 12 2018 23:22
thanks I will give that a shot. How does that affect Nextflow though?
Tobias "Tobi" Schraink
@tobsecret
Dec 12 2018 23:27
It sets the maximum/ minimum heap size - meaning how much RAM NextFlow uses. Had to do this when I was running a very large workflow and it would get killed by a cron job. Might not work with your problem - seems like your problem is more related to how quickly the slurm instance can report back on progress of a job