These are chat archives for nextflow-io/nextflow

13th
Dec 2018
Paolo Di Tommaso
@pditommaso
Dec 13 2018 07:46
@tobsecret "getting loads of concurrentModificationError messages lately" this is not good, please open an issue including the error stack trace you can find the .nextflow.log file
Tobias Neumann
@t-neumann
Dec 13 2018 09:39
@pditommaso any idea on what I could look at if jobs get submitted to AWS batch with a 'docker://' prefix to my container directives (where they then crash obviously with CannotPullContainerError: API error (400): invalid reference format and when removing these prefixes it is stuck in [warm up] executor > awsbatch? Really lost here...
Paolo Di Tommaso
@pditommaso
Dec 13 2018 14:23
container images must be same as ingested by the docker run/pull command, docker:// *was* only required by singularity
Tobias Neumann
@t-neumann
Dec 13 2018 14:57
That is exactly my point: When I use the wrong container directive with the docker:// prefix, the jobs get submitted to AWS batch (where they crash, but this is obviously expected). The second I fix this container directive by removing the docker:// prefix however, the jobs do not get submitted anymore to AWS batch and the nextflow process is stuck in the [warum up] executor > awsbatch state. For some weird reason, correcting this error messes up the job submission process
Tobias Neumann
@t-neumann
Dec 13 2018 15:18
is there something like public test s3 buckets?
because I need to put indices and testdata somewhere
Tobias "Tobi" Schraink
@tobsecret
Dec 13 2018 15:26
@pditommaso Okidoki, lemme check if I still have those
Tobias Neumann
@t-neumann
Dec 13 2018 15:37
It basically stops in the middle of creating the working directories and only some of them have the .command* files
2018-12-13 16:28:41          0 work/centrifuge/e4/55c2594a9061bae03a4dff1683f32e/
2018-12-13 16:28:41          0 work/centrifuge/ec/2e9bf7b991efd69a6e96634cb23be9/
2018-12-13 16:28:41          0 work/centrifuge/f0/8d4d94a46731e319d38e5b0aea2f37/
2018-12-13 16:28:41       5111 work/centrifuge/f0/8d4d94a46731e319d38e5b0aea2f37/.command.run
2018-12-13 16:28:41        299 work/centrifuge/f0/8d4d94a46731e319d38e5b0aea2f37/.command.sh
2018-12-13 16:28:41       3419 work/centrifuge/f0/8d4d94a46731e319d38e5b0aea2f37/.command.stub
2018-12-13 16:28:41          0 work/centrifuge/f2/258ef0a5d79512dcc9799b5f3d9141/
HA ok now I got some clues - I have 50 pairs of fastq sequences
KochTobi
@KochTobi
Dec 13 2018 15:41
aws s3 cannot handle rolling files had the same issue with the Sarek pipeline SciLifeLab/Sarek#680
Tobias Neumann
@t-neumann
Dec 13 2018 15:42
If I only supply one pair - the submission gets through and all processes start nicely:
[warm up] executor > awsbatch
[95/8cc80c] Submitted process > centrifugePaVE (fdbc62bb-592f-4452-904a-02e51088b0d6_gdc_realn_rehead)
Maxime Garcia
@MaxUlysse
Dec 13 2018 15:55
We had troubles but it was with the trace
cf #916
which is a duplicate of #813
Tobias "Tobi" Schraink
@tobsecret
Dec 13 2018 16:25
@stevekm seeing the same problems (i.e. NOTE: Error submitting process '...' for execution -- Execution is retried)
Stephen Kelly
@stevekm
Dec 13 2018 16:51
@tobsecret you mean on big purple today? Yeah my pipeline just died 2hrs ago from that as well.
[f4/92c717] NOTE: Error submitting process 'eval_pair_vcf (SampleID.MuTect2)' for execution -- Execution is retried (1)
ERROR ~ Error executing process > 'eval_pair_vcf (SampleID.MuTect2)'

Caused by:
  Failed to submit process to grid scheduler for execution

Command executed:

  sbatch .command.run

Command exit status:
  1

Command output:
  sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

Work dir:
  /gpfs/data/molecpathlab/production/NGS580/180316_NB501073_0036_AH3VFKBGX5/work/f7/55e93e2adce10c298b177ba56dfea1

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details
WARN: Killing pending tasks (9)
WARN: [SLURM] queue (cpu_medium) status cannot be fetched > exit status: 1
yeah once again, the job actually did get submitted to SLURM and run;


kellys04@bigpurple-ln4:/gpfs/data/molecpathlab/production/NGS580/180316_NB501073_0036_AH3VFKBGX5/work/f7/55e93e2adce10c298b177ba56dfea1$ lt
total 356K
drwx--S--- 14 kellys04 8.0K Dec 13 09:02 ..
-rw-------  1 kellys04  243 Dec 13 09:02 .command.sh
-rw-------  1 kellys04 3.5K Dec 13 09:02 .command.stub
-rw-------  1 kellys04 5.8K Dec 13 09:02 .command.run
-rw-------  1 kellys04    0 Dec 13 09:02 .command.begin
-rw-------  1 kellys04 5.9K Dec 13 09:02 .env.begin
-rw-------  1 kellys04  272 Dec 13 09:15 .command.out
-rw-------  1 kellys04 4.9K Dec 13 09:15 .command.err
-rw-------  1 kellys04  16K Dec 13 09:15 SampleID.MuTect2.eval.grp
-rw-------  1 kellys04  204 Dec 13 09:15 .command.trace
-rw-------  1 kellys04 5.5K Dec 13 09:15 .command.log
drwx--S---  2 kellys04 8.0K Dec 13 09:15 .
-rw-------  1 kellys04    1 Dec 13 09:15 .exitcode

[2018-12-13 11:52:24]
kellys04@bigpurple-ln4:/gpfs/data/molecpathlab/production/NGS580/180316_NB501073_0036_AH3VFKBGX5/work/f7/55e93e2adce10c298b177ba56dfea1$ cat .exitcode
0
[2018-12-13 11:52:54]
kellys04@bigpurple-ln4:/gpfs/data/molecpathlab/production/NGS580/180316_NB501073_0036_AH3VFKBGX5/work/f7/55e93e2adce10c298b177ba56dfea1$ head .command.log
USER:kellys04 SLURM_JOB_ID:545799 SLURM_JOB_NAME:nf-eval_pair_vcf_(SampleID.MuTect2) HOSTNAME:cn-0030 PWD:/gpfs/data/molecpathlab/production/NGS580/180316_NB501073_0036_AH3VFKBGX5/work/f7/55e93e2adce10c298b177ba56dfea1 NTHREADS:none
nxf-scratch-dir cn-0030:/tmp/nxf.4LyRxSmCip
INFO  14:02:57,419 HelpFormatter - ----------------------------------------------------------------------------------
INFO  14:02:57,422 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
INFO  14:02:57,422 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO  14:02:57,423 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO  14:02:57,423 HelpFormatter - [Thu Dec 13 14:02:57 UTC 2018] Executing on Linux 3.10.0-693.17.1.el7.x86_64 amd64
INFO  14:02:57,423 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_102-8u102-b14.1-1~bpo8+1-b14
INFO  14:02:57,426 HelpFormatter - Program Args: -T VariantEval -R genome.fa -o SampleID.MuTect2.eval.grp --dbsnp dbsnp_138.hg19.vcf --eval SampleID.MuTect2.filtered.vcf
INFO  14:02:57,438 HelpFormatter - Executing as kellys04@cn-0030 on Linux 3.10.0-693.17.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_102-8u102-b14.1-1~bpo8+1-b14.
micans
@micans
Dec 13 2018 17:33
@pditommaso would you be open to a feature request for a method on a file object that returns its number of lines? Something like file.numLines(). I assume file.readLines().size() might work, but that's a roundabout way. This would be useful to me when I construct a 'metafile' with all the inputs (and want to adapt memory dynamically).
Stephen Kelly
@stevekm
Dec 13 2018 17:36
filed a bug report for the SLURM issues here: nextflow-io/nextflow#970
@micans I have a couple different ways to count the number of lines shown here: https://github.com/stevekm/nextflow-demos/blob/master/filter-channel/main.nf
micans
@micans
Dec 13 2018 17:48
Thanks @stevekm -- as far as I can see file.readLines().size() is pretty idiomatic. It might be worth a shortcut.
Stephen Kelly
@stevekm
Dec 13 2018 17:49
yeah as I had noted there, the downside is that it reads the entire file into memory so if you're trying to read a giant file you might want the other one instead
import java.nio.file.Files;
long count = Files.lines(sample_tsv).count()
micans
@micans
Dec 13 2018 17:52
OK cool, good to know!
Ooh, tardigrade, favourite animal. And now it helps me with nextflow, one of life's surprises.
micans
@micans
Dec 13 2018 18:26
Question: I apply sum() to a Channel, and would like to stick it in a groovy variable for logging purposes. Haven't found out how yet.
Oh, .subscribe{ n_numreads = it } seems to do it.
Tobias "Tobi" Schraink
@tobsecret
Dec 13 2018 18:53
@stevekm yep, the steps are retried though (maxRetries 3)... Maybe that's what is causing the concurrentModification issues down the line? Jobs are submitted and not registered as submitted by NextFlow and then the same data is downloaded twice and when NextFlow triest to link them to my publishDir, it gets two processes trying to modify the same link?
Stephen Kelly
@stevekm
Dec 13 2018 19:05

that is strange, I have been using maxRetries for a little bit now and have not had that issue. However I just updated my config for big purple to look like this:

params.ref_dir = "/gpfs/scratch/kellys04/molecpathlab/ref"
        params.ANNOVAR_DB_DIR = "${params.ref_dir}/annovar/db"

        // SLURM exector config
        process.executor = 'slurm'
        executor.queueSize = 50 // submit up to 50 jobs at a time
        executor.pollInterval = '5min' // *** how often a poll occurs to check for a process termination.
        executor.queueStatInterval = '5min' // *** how often the queue status is fetched from the cluster system.
        executor.exitReadTimeout = '5min' // *** how long the executor waits before return an error status when a process is terminated but the exit file does not exist or it is empty.
        executor.killBatchSize = 10 // *** the number of jobs that can be killed in a single command execution
        executor.submitRateLimit = '10 sec' // *** the max rate of jobs that can be executed per time unit, for example '10 sec' eg. max 10 jobs per second
        params.queue = "cpu_short" // allow to set queue from CLI
        process.queue = "${params.queue}"
        process.clusterOptions = '--ntasks-per-node=1 --export=NONE --export=NTHREADS --mem-bind=local' // --time-min=3:00:00 --tmp=16G

        // Singularity config
        process.module = "singularity/2.5.2"
        singularity.enabled = true
        singularity.autoMounts = true
        singularity.envWhitelist = "NTHREADS"

        // job config value
        params.cpus_num_big = 16
        params.cpus_num_mid = 8
        params.cpus_num_small = 4

        // global process config
        // try to prevent error: module: command not found by sourcing module config, and pausing to allow environment to finish populating
        process.beforeScript = ' . /etc/profile.d/modules.sh; sleep 1; printf "USER:\${USER:-none} SLURM_JOB_ID:\${SLURM_JOB_ID:-none} SLURM_JOB_NAME:\${SLURM_JOB_NAME:-none} HOSTNAME:\${HOSTNAME:-none} PWD:\$PWD NTHREADS:\${NTHREADS:-none}\n"; TIMESTART=\$(date +%s); env > .env.begin'
        process.afterScript = 'printf "elapsed time: %s\n" \$((\$(date +%s) - \${TIMESTART:-0})); env > .env.end'
        process.errorStrategy = "retry" // re-submit failed processes; try to mitigate SLURM and 'module' command not found errors, etc
        process.maxRetries = 1 // retry a failed process up to 1 times as per ^^
        process.cpus = 2 // 2 CPUs default due to cgroups on Big Purple limiting access to SLURM allocated cores only
        // also add 1 greater CPU allocation than designated in tasks to allow for process overhead threads
        filePorter.maxThreads = 4 // number of Nextflow process threads for moving files
        process.scratch = true
        // process.scratch = "/gpfs/scratch/${username}"

I put a *** next to the items I just added this morning, taking a shotgun-approach to try and baby the scheduler. SLURM has definitely been lagging a lot because even things like sinfo have returned timeout and connection errors, so I am thinking that maybe its just lagging so much that Nextflow thinks it timed out when its really just being slow. So @tobsecret yeah perhaps its possible that Nextflow it trying to retry the jobs it thinks did not get submitted, then ends up submitting a second job, and both run and end up with the same files being worked on by Nextflow? idk doesn't sound too far-fetched

oh @tobsecret but what publishDir mode are you using? Right now I am just symlinking them in the publishDir during Nextflow execution, then later I resolve all the symlinks myself if the results look OK, maybe that makes the difference. Symlink requires no time at all to create but large files might take a while to copy, especially since the gpfs has been having IO throttling issues, so you could get the same files trying to be copied to publishDir???
Tobias "Tobi" Schraink
@tobsecret
Dec 13 2018 19:37
I am just using symlink for publishDir. But today I also have not gotten any concurrentModificationException errors yet and I have downloaded 409 files so far. But yes, something in general has been a bit off with our cluster lately - according to the report upon login, I have a negative GB of files in my home directory :sweat_smile:
Will try and add your suggested changes to my nextflow.config if I end up running into problems - thanks @stevekm
Tobias Neumann
@t-neumann
Dec 13 2018 20:23
@pditommaso So I think I know where the bottleneck is: I did not have any AWS batch job for the tasks I submitted - and when I launched all 50 samples the engine somehow got hung up - because no jobs where submitted and also no job description was created. then I repeated the process only on one sample which then actually got submitted and a job description created. with the job description being available, when I later tried all 50 samples it worked without problem. Could this be the reason?
Tobias "Tobi" Schraink
@tobsecret
Dec 13 2018 21:33

@pditommaso hit the same problem with the concurrentModificationException again and filed a bug report:
nextflow-io/nextflow#971

Kinda difficult to make a minimal verifiable and complete example but I tried.

Stephen Kelly
@stevekm
Dec 13 2018 22:58
@tobsecret have you talked with the hpc admins about the downloads? I know that network has been a sticking point at times in the past so they might have ideas on how to make things more reliable for it on the cluster, or they might know of restrictions that are affecting it
Tobias "Tobi" Schraink
@tobsecret
Dec 13 2018 23:22
It's not the latency I think - if the latency was the issue, the processes would throw an error instead of failing to submit.
also HOLY ** how are you so fast @pditommaso ?! Giving the NextFlow patch a go rn
Tobias "Tobi" Schraink
@tobsecret
Dec 13 2018 23:35
Should know tomorrow morning how well it did :muscle: