These are chat archives for nextflow-io/nextflow

14th
May 2019
Paolo Di Tommaso
@pditommaso
May 14 14:28
Nextflow Camp speakers! :point_down: :point_down:
Alaa Badredine
@AlaaBadredine_twitter
May 14 14:39
@pditommaso thanks for sharing, how can we apply for this course to join ? Will it be streamed online ?
Paolo Di Tommaso
@pditommaso
May 14 14:41
of course, application for course & camp is here
streamed, likely not
Alaa Badredine
@AlaaBadredine_twitter
May 14 14:42
thanks for the link !
Paolo Di Tommaso
@pditommaso
May 14 14:42
:+1:
Tobias "Tobi" Schraink
@tobsecret
May 14 15:02
How do you flatten a channel that's just one file glob expression? i.e. file('*.vcf.gz') -> Channel.from([file('1.vcf.gz', '2.vcf.gz', '3.vcf.gz'])
The rationale here is that I have a process to download a couple of files which spits them all out as a single file(*.vcf.gz) into vcf_ch which I tried to consume by the following channel, using file(inputvcf) from vcf_ch.flatten()
Tobias "Tobi" Schraink
@tobsecret
May 14 15:09
This is also what the nextflow patterns suggests:
http://nextflow-io.github.io/patterns/index.html#_process_per_file_output
But then in the downstream process I get the error gzip: *.vcf.gz: not in gzip format, so it seems likevcf_ch.flatten()`does not actually emit single files?
micans
@micans
May 14 15:12
Maybe you need file('*.vcf.gz') into vcf_ch?
gzip is seeing something literally called *.vcf.gz
Or the glob doesn't match?
Tobias "Tobi" Schraink
@tobsecret
May 14 15:15
Yeah, I'm a bozo - I think the glob didn't match. Needed to be file('**.vcf.gz')
:facepalm:
micans
@micans
May 14 15:16
ah ... does the double glob include directories?
Tobias "Tobi" Schraink
@tobsecret
May 14 15:24
the files were downloaded using wget and I forgot the flag that flattens the wget directories, so the *.vcf.gzglob didn't find it - meanwhile the **.vcf.gz glob should find them
Alaa Badredine
@AlaaBadredine_twitter
May 14 16:08
I opened a new issue nextflow-io/nextflow#1149
Stephen Kelly
@stevekm
May 14 16:41

@stevekm I am running things from scratch but keep the project dirs on home bc scratch is flushed on prince (supposedly) and I don't have enough space/ # of files allowed on home

@tobsecret it sounds like you might want the 'scratchDir' directive? You can tell it where the scratch location is to do the work in then it copies the final results back to the work dir. And you could have your publishDir be another location from where you are executing if you wanted. On Big Purple home dirs have a low file number limit so you have to do everything on lab data directories

Anybody used glob patterns with ftp servers before?

@tobsecret you would probably want to mount the filesystem first. Not sure exactly what the best way is, there are some listed on Google, also on Mac I have used Mountain Duck https://mountainduck.io/ but I dont know if there is a linux version

Stephen Kelly
@stevekm
May 14 16:47

What's the best way to add something to the PATH environment variable in a configuration file while using modules?

@Puumanamana I usually shove all those things in the beforeScript section, for example: https://github.com/NYU-Molecular-Pathology/NGS580-nf/blob/a7aa45a12a4f00436cb5bbeab3d69369d6a166b0/nextflow.config#L480

I actually have to manually load the 'module' command because on our system it hits race conditions where 'module' command is not available yet when the script tries to load modules. You can similarly put the PATH update commands in here as well, thats typically how I also use it with conda because I have specific conda installations I want to use which are not accessible globally.

micans
@micans
May 14 16:49
@AlaaBadredine_twitter I've added a comment to your issue.
Stephen Kelly
@stevekm
May 14 16:52
@micans did not know that true was a thing in bash, I used grep <patter> <file> || : for this, but same thing @AlaaBadredine_twitter
micans
@micans
May 14 16:54
@stevekm in the old days : was apparently faster, but these days true is usually a shell builtin I think. I was quite late learning about :!
Was always a true believer
Stephen Kelly
@stevekm
May 14 16:59
lol
Stephen Kelly
@stevekm
May 14 17:08

@pditommaso I keep getting sporadic errors that look like this:

ERROR ~ Error executing process > 'mutect2 (228)'

Caused by:
  Process `mutect2 (228)` terminated for an unknown reason -- Likely it has been terminated by the external system

The errors occur at random, usually after a pipeline has been running for some time. They are not reproducible. The pipeline is running inside a SLURM job, and the pipeline both submits other SLURM jobs to execute tasks as well as using local executor to run inside the resource allocation in the parent Nextflow job (this specific job was submitted externally to SLURM). When I look up the Nextflow process that is supposed to have been "terminated" by the external system, it is always completed successfully. So for some reason, Nextflow thinks that these processes are being terminated by the external system, when they are not, they are completing successfully.

I have been going back and forth with our HPC admins about this, and they suggest that its likely that something like the system OOM Killer is killing processes that it thinks are using too much memory. Does this sound plausible? Could there be some child process that Nextflow is running to monitor these jobs that is getting killed, and causing the error reported here? The parent Nextflow process is not getting killed, it is surviving and detecting the error and issues 'retries' to try to re-run the task that it thinks is failed. When I check the SLURM job that was running the parent Nextflow pipeline, it usually looks like this:

       JobID    JobName  Partition  AllocCPUS     AveCPU     MaxRSS     ReqMem              Submit               Start                 End  Timelimit        NodeList      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- ---------- ------------------- ------------------- ------------------- ---------- --------------- ---------- --------
1950889      NGS580-de+  fn_medium          8                             48Gn 2019-05-13T11:31:00 2019-05-13T11:31:04 2019-05-13T21:12:18 5-00:00:00         fn-0001  COMPLETED      0:0
1950889.bat+      batch                     8   00:00:01     16368K       48Gn 2019-05-13T11:31:04 2019-05-13T11:31:04 2019-05-13T21:12:18                    fn-0001  COMPLETED      0:0
1950889.ext+     extern                     8   00:00:00      2252K       48Gn 2019-05-13T11:31:04 2019-05-13T11:31:04 2019-05-13T21:12:18                    fn-0001  COMPLETED      0:0

You can see here that I requested 48GB memory for the job, but the reported Max RSS memory is only 16MB (?). Not sure if this is accurate. Is there a way to know how much memory Nextflow is using, and how much memory its child processes are using? And any ideas on what could be causing these errors? Previously I thought it could be the filesystem not responding fast enough, but the admins are not reporting any filesystem lag and I have the process timeout set to ~30min and still get these errors.

micans
@micans
May 14 17:31
@stevekm I sudenly wonder -- on LSF cpu usage is monitored. I never considered this, but e.g. with -trace the job might utilise one (or more?) additional CPUs. This would be caught only infrequently, but would be a reason to kill the job - but it does not match your description of completing succesfully.
Stephen Kelly
@stevekm
May 14 17:33
I am finding more strange things the more I investigate
we have cgroups enabled so even if you spawn more threads they are restricted to the number of physical cores allocated in the job, in this case 8
Stephen Kelly
@stevekm
May 14 17:43

I am digging more into this actual failed SLURM job and found this:

SLURM_JOB_ID: 1962787
SLURM_JOB_NAME: nf-mutect2_(228)
SLURM_JOB_NODELIST: cn-0031
SLURM_JOB_PARTITION: cpu_short

$ saj 1962787
       JobID    JobName  Partition  AllocCPUS     AveCPU     MaxRSS     ReqMem              Submit               Start                 End  Timelimit        NodeList      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- ---------- ------------------- ------------------- ------------------- ---------- --------------- ---------- --------
1962787      nf-mutect+  cpu_short          1                             12Gn 2019-05-13T20:36:39 2019-05-13T20:39:24 2019-05-13T21:15:02   06:00:00         cn-0031  COMPLETED      0:0
1962787.bat+      batch                     1   00:34:34  10803544K       12Gn 2019-05-13T20:39:24 2019-05-13T20:39:24 2019-05-13T21:15:02                    cn-0031  COMPLETED      0:0
1962787.ext+     extern                     1   00:00:00      2252K       12Gn 2019-05-13T20:39:24 2019-05-13T20:39:24 2019-05-13T21:15:02                    cn-0031  COMPLETED      0:0

$ ls -ltra
-rw-r--r--  1 kellys04    196 May 13 21:15 .command.trace
-rw-rw----  1 kellys04   1862 May 13 21:15 .command.log
-rw-rw----  1 kellys04   7168 May 13 21:15 .env.end
-rw-rw----  1 kellys04      1 May 13 21:15 .exitcode

$ cat .exitcode
0

The SLURM job 1962787 completed successfully. But there seems like there might be discrepancies when you look at .nextflow.log;

lots of messages that look like this

May-13 21:08:16.045 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SLURM] queue (cpu_long) status >
  (empty)
May-13 21:08:16.045 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `1962787` exit file: /gpfs/data/molecpathlab/development/NGS580-development-runs/demo-data-test2/work/e8/a18315a4a44272eda33d3ac2
5cddc1/.exitcode - lastModified: null - size: null

Nextflow keeps trying to find job 1962787 in the cpu_long queue, which supposedly is the one that Nextflow used when it submitted to SLURM, but it did not run in that queue, because it got reassigned to cpu_short because our SLURM automatically re-queues jobs based on the time request, in this case the time was short enough to fit in the cpu_short queue SLURM placed it there instead. @pditommaso would this cause Nextflow to think that the job is in an error state?

Later in the log you can see where it finally gave up on this job:

May-13 21:12:15.997 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `1962787` exit file: /gpfs/data/molecpathlab/development/NGS580-development-runs/demo-data-test2/work/e8/a18315a4a44272eda33d3ac25cddc1/.exitcode - lastModified: null - size: null
May-13 21:12:15.997 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SLURM] queue (cpu_long) status > map does not contain jobId: `1962787`
May-13 21:12:15.998 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 1962787; id: 2538; name: mutect2 (228); status: RUNNING; exit: -; error: -; workDir: /gpfs/data/molecpathlab/development/NGS580-development-runs/demo-data-test2/work/e8/a18315a4a44272eda33d3ac25cddc1 started: 1557794368929; exited: -; ] -- exitStatusReadTimeoutMillis: 1800000; delta: 1800260
Current queue status:
>   (empty)

Content of workDir: /gpfs/data/molecpathlab/development/NGS580-development-runs/demo-data-test2/work/e8/a18315a4a44272eda33d3ac25cddc1
null
May-13 21:12:15.998 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 1962787; id: 2538; name: mutect2 (228); status: COMPLETED; exit: -; error: -; workDir: /gpfs/data/molecpathlab/development/NGS580-development-runs/demo-data-test2/work/e8/a18315a4a44272eda33d3ac25cddc1 started: 1557794368929; exited: -; ]
May-13 21:12:16.112 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'mutect2 (228)'

Caused by:
  Process `mutect2 (228)` terminated for an unknown reason -- Likely it has been terminated by the external system

This is the only time where the exitStatusReadTimeoutMillis gets mentioned in the log, from what I can tell, I had it set to 1800000ms (30m). Nextflow declared this job 'dead' at 21:12:15, but the .exitcode was successfully produced/updated at 21:15, 3 minutes later... its not clear to me if the .exitcode file existed prior to this or if Nextflow could not see it because it lists the workDir contents as null.

Stephen Kelly
@stevekm
May 14 18:08
I am wondering what exactly the exitStatusReadTimeoutMillis is being used for? It seems that Nextflow spent ~36min waiting on this job; not sure which part of this might have been the 30min timeout I set? I am not clear where in the logs here Nextflow actually started considering this job to be in an error state. Should I make my exitStatusReadTimeoutMillis the same as my time for these jobs? That does not seem right but maybe it would solve this? Also not clear if [SLURM] queue (cpu_long) status > map does not contain jobId: 1962787 caused an error state, because at the end it also says status: RUNNING; then later status: COMPLETED;, so what actually triggered the errors?
Tobias "Tobi" Schraink
@tobsecret
May 14 18:19
Anybody have an idea as to how one would split a file with a header into chunks in Nextflow? Thinking maybe it's possible to do so using the route and the splitText operators.
Stephen Kelly
@stevekm
May 14 18:20
If it has a header you can use csv reader to read it as a dict
however if you have complicated splitting criteria you might want to use an external script unless you have strong Groovy-fu
I did chunking via Python here: https://github.com/stevekm/nextflow-demos/tree/master/bed-split ; using simple .bed file so no header but the scripts could be adapted easily to respect the header
Tobias "Tobi" Schraink
@tobsecret
May 14 18:31
rn I am just using grep and split, so it's pretty simple to do in bash - just figure it's a neat thing to have under one's belt in groovy.
The file I am trying to split is a vcf file, so there are multiple lines of header that each begin with #.
Stephen Kelly
@stevekm
May 14 18:35
if you want to see some basic vcf parsing in Nextflow I have some here: https://github.com/stevekm/nextflow-demos/blob/master/filter-channel/main.nf#L44
I only do line counts though
vcf is a pain to deal with so in my workflows I generally convert it to .tsv as soon as possible with GATK: https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php
then you can do a lot of manipulation on it a lot easier
Michael L Heuer
@heuermh
May 14 18:38
There is split-vcf in dishevelled-bio, available on Bioconda & Biocontainers
https://anaconda.org/bioconda/dsh-bio
I haven't used it from Nextflow though, I'm not exactly sure how to capture all the split file names created as outputs
najitaleb
@najitaleb
May 14 18:43
I was wondering if I could get some clarification on the last 2 posts on this page https://groups.google.com/forum/#!topic/nextflow/c_9kocrHxeg . I have the same issue where R scripts in my pipeline write to the output directory directly by using a config file. I am not sure what the best way to integrate this into nextflow would be.
Tobias "Tobi" Schraink
@tobsecret
May 14 19:11
@heuermh @stevekm Thanks for the pointers! I think I'll just stick with my simple bash pipeline - it really isn't all that difficult. Was just thinking about groovyisms for better reproducibility bc not every system will have split, zgrep and zcat
Michael L Heuer
@heuermh
May 14 19:12
Guess this does it
#!/usr/bin/env nextflow

params.dir = "${baseDir}/example"

vcfFiles = "${params.dir}/**.vcf"
vcfs = Channel.fromPath(vcfFiles).map { path -> tuple(path.baseName, path) }

process split {
  tag { sample }

  input:
    set sample, file(vcf) from vcfs
  output:
    set sample, file("*.vcf") into splitVcfs

  """
  dsh-bio split-vcf -r 100 -p "${sample}." -s ".vcf" -i $vcf
  """
}

splitVcfs.subscribe {
  println "Split ${it.get(0)} to ${it.get(1).size()} files:"
  println "${it.get(1).toString()}"
}
Stephen Kelly
@stevekm
May 14 19:32

not every system will have split, zgrep and zcat

@tobsecret I worry about that as well, I think the final solution for it is to start utilizing conda/singularity/docker to bundle in the GNU/bash tools you use, at some point it becomes unavoidable because for example even between macOS and Linux there are difference in shell tool commands. However this is also one of the reason I write a lot of it as Python scripts when possible, if you can target Python 2/3 compatible syntaxes then you can just write a small script that will work on all systems with Python, which is pretty much guaranteed to exist on any system that you might ever be running Nextflow on

its a balance between just writing your codes to work in the situation you encounter today, or to try and go an extra mile to future proof it to avoid problems you predict you might run into in the future. At some point its just easier to implement the simple solution now then fix it later when you actually need it to be different
Tobias "Tobi" Schraink
@tobsecret
May 14 19:45
Yeah, that makes sense! We'll keep it simple for now, maybe revisiting it later
@stevekm do you know how collect and set interact?
i.e. set val(someval), file(somefile) from some_ch.collect() - does that just work?
Stephen Kelly
@stevekm
May 14 19:46
it shoudl
wait no
.collect() gives you a big list
Tobias "Tobi" Schraink
@tobsecret
May 14 19:47
yep, that 's why I'm not sure
Stephen Kelly
@stevekm
May 14 19:47
if you want to make it a set you need to combine it
against the channel that emits the value
Tobias "Tobi" Schraink
@tobsecret
May 14 19:48
What if the creation of some_ch originally looked like this: set val(someval), file(somefile) into some_ch.collect()
Stephen Kelly
@stevekm
May 14 19:49
I do it here:
Tobias "Tobi" Schraink
@tobsecret
May 14 19:49
Oh yeah ofc - groupTuple! Thanks a ton!
you start with something that emits set val(sampleID), file(someFile)
then .groupTuple to turn all the someFile into a list of files
then you can pass it off to a process like this:
input:
    set val(tumorID), val(normalID), file(tumorNormalFiles: "*"), file(sampleFiles: "*"), file(report_items: '*') from sample_pairs_output_files2.combine(samples_report_files)
Tobias "Tobi" Schraink
@tobsecret
May 14 19:58
Hmmm, isn't groupTuple going to group by some key? I want to group by position, i.e. [[val1, file1], [val2, file2]] -> [[val1,val2], [file1, file2]]
Stephen Kelly
@stevekm
May 14 19:59
yes that is why I pre-build the channel with a key
// channel for sample reports; gather items per-sample from other processes
sampleIDs.map { sampleID ->
    // dummy file to pass through channel
    def placeholder = file(".placeholder1")

    return([ sampleID, placeholder ])
}
// add items from other channels (unpaired steps)
.concat(sample_signatures_reformated) // [ sampleID, [ sig_file1, sig_file2, ... ] ]
// group all the items by the sampleID, first element in each
.groupTuple() // [ sampleID, [ [ sig_file1, sig_file2, ... ], .placeholder, ... ] ]
// need to flatten any nested lists
.map { sampleID, fileList ->
    def newFileList = fileList.flatten()

    return([ sampleID, newFileList ])
}
.into { sample_output_files; sample_output_files2 }
Tobias "Tobi" Schraink
@tobsecret
May 14 20:00
oh, ok - that makes sense. That's a little bit hacky but if there is no dedicated method for what I want to do, then I'll just do it this way!
Stephen Kelly
@stevekm
May 14 20:01
there might be a way to do the exact thing you show but I cannot think of it off the top of the head; [[val1, file1], [val2, file2]] -> [[val1,val2], [file1, file2]]
another hack you can do is to just use .collect() but then immediately re-map it and rebuild the list in Groovy, then emit a nested list like you show, then use .flatten() (I think) to emit each list element (sub-lists) individually
Stephen Kelly
@stevekm
May 14 20:07

also I kinda figured out the solution to the errors I was getting @pditommaso
because Big Purple keeps re-assigning your jobs to different queues based on the time requirements, I get a lot of messages in the log like this:

May-13 20:39:29.478 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SLURM] getting queue (cpu_long) status > cmd: squeue --noheader -o %i %t -t all -p cpu_long -u kellys04
May-13 20:39:37.769 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SLURM] queue (cpu_long) status > cmd exit: 0

May-13 20:39:37.769 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SLURM] queue (cpu_long) status >
  (empty)
May-13 20:39:37.769 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SLURM] queue (cpu_long) status > map does not contain jobId: `1962787`

Nextflow thinks my job should be under the cpu_long queue, which it submitted to, but actually the job got moved by SLURM to the cpu_short queue. This, combined with the job still running and not having produced .exitcode seems to have resulted in it triggering the exitReadTimeout which was set to 30min

May-13 21:12:15.997 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `1962787` exit file: /gpfs/data/molecpathlab/development/NGS580-development-runs/demo-data-test2/work/e8/a18315a4a44272eda33d3ac25cddc1/.exitcode - lastModified: null - size: null
May-13 21:12:15.997 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SLURM] queue (cpu_long) status > map does not contain jobId: `1962787`
May-13 21:12:15.998 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 1962787; id: 2538; name: mutect2 (228); status: RUNNING; exit: -; error: -; workDir: /gpfs/data/molecpathlab/development/NGS580-development-runs/demo-data-test2/work/e8/a18315a4a44272eda33d3ac25cddc1 started: 1557794368929; exited: -; ] -- exitStatusReadTimeoutMillis: 1800000; delta: 1800260
Current queue status:
>   (empty)

Content of workDir: /gpfs/data/molecpathlab/development/NGS580-development-runs/demo-data-test2/work/e8/a18315a4a44272eda33d3ac25cddc1
null
May-13 21:12:15.998 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 1962787; id: 2538; name: mutect2 (228); status: COMPLETED; exit: -; error: -; workDir: /gpfs/data/molecpathlab/development/NGS580-development-runs/demo-data-test2/work/e8/a18315a4a44272eda33d3ac25cddc1 started: 1557794368929; exited: -; ]
May-13 21:12:16.112 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'mutect2 (228)'

Caused by:
  Process `mutect2 (228)` terminated for an unknown reason -- Likely it has been terminated by the external system
But it leaves me wondering why Nextflow is using this command to check the queue: cmd: squeue --noheader -o %i %t -t all -p cpu_long -u kellys04 ; if Nextflow omitted the -p cpu_long portion it would have found the job 1962787 under the cpu_short queue
Tobias "Tobi" Schraink
@tobsecret
May 14 20:12
I think I figured out what I was looking for - it's transpose
micans
@micans
May 14 20:12
@stevekm nice work! And: At some point its just easier to implement the simple solution now then fix it later when you actually need it to be different - word :-)
Stephen Kelly
@stevekm
May 14 20:19
@tobsecret https://www.nextflow.io/docs/latest/operator.html#transpose oh yeah I was looking at that the other day, thats pretty clever, I could not come up with a situation where I would actually use this one but it fits your use case
Tobias "Tobi" Schraink
@tobsecret
May 14 20:40
:sweat_smile: Yep, exactly that one - I remember having used it once before. It's very useful when you need to convert back and forth between groupTuple and the input to groupTuple. That time, I used it to split up things that were grouped together, so they could be processed in parallel and groupTupled the results together again.
Paolo Di Tommaso
@pditommaso
May 14 21:05
I can't follow such complicated topics on gitter, please report on GH if it's an issue
Stephen Kelly
@stevekm
May 14 21:43
Stephen Kelly
@stevekm
May 14 22:12
if I add/remove some env variables to a workflow config, will that break resume feature?