These are chat archives for nextflow-io/nextflow

1st
Mar 2019
Laurence E. Bernstein
@lebernstein
Mar 01 00:06
Are they any known issues with running nextflow from inside a nextflow script? I would like to spin off different workflows based on the input parameters. And can I do this without a problem from a nextflow Docker image?
Luca Cozzuto
@lucacozzuto
Mar 01 09:55
2uxw7l.jpg
Chelsea Sawyer
@csawye01
Mar 01 10:49
Hello, Is the best way to processes optional if a certain condition is true the when directive or just calling an empty channel that the process would not be able to do anything with and therefore would pass over it it?
Luca Cozzuto
@lucacozzuto
Mar 01 10:51
@csawye01 I tend to use when
micans
@micans
Mar 01 11:30
I use when when possible, but sometimes need an empty channel additionally, e.g. when a process starts from a sample file.
Chelsea Sawyer
@csawye01
Mar 01 11:45
Thanks @lucacozzuto and @micans!
Luca Cozzuto
@lucacozzuto
Mar 01 12:03
Now it is my turn :) how to avoid the storage of intermediate files in a workflow? Just setting cache false will work?
PhilPalmer
@PhilPalmer
Mar 01 13:57

Hi I wonder if anyone can help. I have the following data in a channel bam_mutect:

[patient1, sample1, 0, H06HDADXX130110.1.ATCACGAT.20k_reads_1, /somatic-variant-caller/work/8d/e3c3b7174813b4de16dcb5d780d6ef/H06HDADXX130110.1.ATCACGAT.20k_reads_1.bam, /somatic-variant-caller/work/8d/e3c3b7174813b4de16dcb5d780d6ef/H06HDADXX130110.1.ATCACGAT.20k_reads_1.bam.bai]
[patient1, sample2, 1, H06HDADXX130110.2.ATCACGAT.20k_reads_1, /somatic-variant-caller/work/3e/819bf632f06c25392a14d953927ff2/H06HDADXX130110.2.ATCACGAT.20k_reads_1.bam, /somatic-variant-caller/work/3e/819bf632f06c25392a14d953927ff2/H06HDADXX130110.2.ATCACGAT.20k_reads_1.bam.bai]

I am trying to use the choice operator to split the data into two channels based on the third element in the array (0/1)

bamsNormal = Channel.create()
bamsTumour = Channel.create()
bam_mutect.choice(bamsTumour, bamsNormal) {it[2] == 0 ? 1 : 0}

However its not working and both sets of data always ends up in the same channel. Any ideas why? Thanks

Stefan Kjartansson
@StefanKjartansson
Mar 01 13:58
Hi, I have a docker-compose setup with nextflow running in a container and I want to use it to execute all processes inside docker containers on the host. I've mounted the docker socket and it is able to start containers on the host. And I've mounted a local directory ("/foobar/work") as "/work" inside the nextflow container. When I execute the process, the command generated by nextflow mounts /work (directory inside the container) and the execution fails as this path does not exist on the host which docker is running on. To solve this, I could create a "/work" directory on the host (and mount that) but that assumes that I have the ability to do that on the execution environment which I'm not sure that I will have. Is there a way to supply a prefix which nextflow prepends to the workdir when mounting containers?
Tim Dudgeon
@tdudgeon
Mar 01 14:16
@StefanKjartansson Not that I know of. The easiest approach is to use the same path inside and outside the container. Something like -v $PWD:$PWD -w $PWD (or hard code the paths).
micans
@micans
Mar 01 14:19
@PhilPalmer nothing that I can spot. Have you tried inserting a view() just before you use the choice operator?
Tim Dudgeon
@tdudgeon
Mar 01 14:20
Picking up on the earlier topic of running multiple jobs at the same time, does anyone have any further thoughts on the problem I posted about trying to do this with the ignite executor?
Stefan Kjartansson
@StefanKjartansson
Mar 01 14:23
@tdudgeon I solved it by passing setting the envvar NXF_WORK to "/foobar/work" and mounting "/foobar/work" to "/foobar/work" in the configuration for the nextflow container, thanks
Stephen Kelly
@stevekm
Mar 01 14:35

@mchimenti

So I never actually run 'qsub' myself. But doesn't this mean that the main NF job actually runs on the head node?

Based on the configuration you use (which Evan provided an example of) Nextflow creates a bash script for you that will wrap up the commands you have in your process, and this script will contain all the appropriate SGE directives embedded at the top. It will be named ".command.run" and you will find it in the work directory produced during Nextflow execution. When Nextflow executes the.command.run script it uses the qsub command; qsub .command.run. Check out these scripts in the work directory and it will make more sense. Its submitting all your SGE tasks to the scheduler just like and other job.

@mchimenti that said, Nextflow itself will be running on the head node in this situation. That may or may not be something you care about. My HPC admins freak out when they see all the child process threads spawned by programs like Nextflow running on the head nodes, so I wrap the entire parent Nextflow execution in a job submission as well. There is a (complicated) example here: https://github.com/NYU-Molecular-Pathology/NGS580-nf/blob/15c3504bc58810f956187f59bc526c30ad455404/Makefile#L317
PhilPalmer
@PhilPalmer
Mar 01 14:40
Thanks @micans, I was able to get it to work using view() & by changing it[2] == 0 to it[2] =~ 0. So I think the 0 was a string and not an int
Stephen Kelly
@stevekm
Mar 01 14:42

@lastwon1216

is it possible to run nextflow with multiple processes on same node?
using sge as executor

also shown in the link I just posted, except with SLURM instead of SGE, the process would be similar

@danielecook

Does anyone have any teaching slides they would care to share?

I made this slideshow a while back when I was first introducing people to nextflow here at NYU; https://github.com/stevekm/nextflow-demos/blob/docs/docs/Nextflow_presentation.pdf

Stephen Kelly
@stevekm
Mar 01 14:47
@pditommaso are there any updates planned for the version of Nextflow on Anaconda? Looks like its still on v0.30 https://anaconda.org/bioconda/nextflow
Paolo Di Tommaso
@pditommaso
Mar 01 15:03
we upload in conda only stable releases
micans
@micans
Mar 01 15:04
@PhilPalmer great ... ahhhh that was a thought that actually crossed my mind (string type)! You could use toInteger() rather then depend on matching, but not a big deal.
Alexander Peltzer
@apeltzer
Mar 01 15:04
There is a 19.01-0 version there ? @stevekm
Michael Chimenti
@mchimenti
Mar 01 15:48
thank you @stevekm :+1:
wow, @stevekm , that's next level...complicated. Being that I've only been at this nextflow stuff for a couple days, I'm gonna have work up to that
Michael Chimenti
@mchimenti
Mar 01 17:16
Can "NXF_*" environment variables be set in the "nextflow.config" file?
Paolo Di Tommaso
@pditommaso
Mar 01 17:33
no
Michael Chimenti
@mchimenti
Mar 01 19:34
:thumbsup:
Stephen Kelly
@stevekm
Mar 01 19:36
has anybody ever tried using Nextflow.... to run Nextflows?
now I've got like a hundred instances of my pipeline for every sequencing run, many of them are out of date and need to be updated to the latest version of the repo, and then need to be re-run
simply going through and bringing my Nextflow pipeline up to date for each run is becoming a huge chore in itself
its like I almost need a Nextflow pipeline that can go through all my completed Nextflow pipelines, update each one's git repo, then run it again to completion
Michael Chimenti
@mchimenti
Mar 01 19:38
what is your use case? why 100 separate instances per seq run? I'm curious, I have no answers for you :)
Stephen Kelly
@stevekm
Mar 01 19:39
I am doing the sequencing analysis for a lab here that is producing new runs almost weekly
so they have 100+ runs I have done this way but now some of the older ones are not current with changes I have made in the Nextflow pipeline
so I have to go update 100 Nextflow pipelines and run them again... on the hpc...
each instance is a separate sequencing run
Michael Chimenti
@mchimenti
Mar 01 19:43
naive question, but why can't you just update one pipeline and push all of the old data thru it?
Stephen Kelly
@stevekm
Mar 01 19:44
all the sequencing runs are independent
and its way too much data for a single instance of the pipeline
Michael Chimenti
@mchimenti
Mar 01 19:45
got it
Stephen Kelly
@stevekm
Mar 01 19:45
1 run takes like 24hrs, so 100 runs would take like 2400 hrs, like 100 days lol
and growing too
Michael Chimenti
@mchimenti
Mar 01 19:47
Are the changes throughout the pipeline, or you could you start from alignments? From variants, etc...? That would save time
Stephen Kelly
@stevekm
Mar 01 19:47
yeah I already use the 'resume' feature extensively which helps
Michael Chimenti
@mchimenti
Mar 01 19:48
well, I'm still an NF novice, so I'll let other chime in...hope you find a good solution
Stephen Kelly
@stevekm
Mar 01 19:49
but even 'resume' appears to be bottlenecked by the 'publishDir'; it might take only 30min to run the updated processes but then it sometimes take another 1-2hr for Nextflow to re-copy the files to the publishDir @pditommaso does -resume skip copying to publishDir on cached process? My experience so far says No because I see all files being updated even if only a few tasks were re-run
Sinisa Ivkovic
@sivkovic
Mar 01 20:26
Hi, I'm testing the nexflow with AWS Batch, and it looks to me that when execution of one job is completed neither docker container or files (inputs and produced outputs) are deleted. So basically every job will just increase disk usage, and eventually disk became full and jobs starts failing. Is there any way that nextflow delete these files after they are not needed anymore or remove container after job completes?
Ghost
@ghost~598345d2d73408ce4f6ff925
Mar 01 20:44
When running a Nextflow pipeliner from a configuration file, is there a way to set the working directory of nextflow? By default it seems Nextflow sets the working directory to wherever Nextflow is being executed from.
evanbiederstedt
@evanbiederstedt
Mar 01 20:50

@sivkovic Let's say for a pipeline you have three processes: alignment (bwa mem), mark duplicates (GATK), and then variant calling (HaplotypeCaller).

it looks to me that when execution of one job is completed neither docker container or files (inputs and produced outputs) are deleted.

You're saying when bwa mem finishes, all of the outputs from that (e.g. the SAM) and the docker container remain. And then the disk space grows, thus requiring a massive AWS instance. Is that right?

Laurence E. Bernstein
@lebernstein
Mar 01 20:54
@stevekm there are various options for the publishDir that might help you. Like just linking them instead of copying? I also thought that my files were not re-copying if I use resume. I can test though.
Laurence E. Bernstein
@lebernstein
Mar 01 21:07
@stevekm I just retested with the workflow I'm developing and no outputs from the cached processes are copied if I use resume.
Stephen Kelly
@stevekm
Mar 01 21:43
@lebernstein that is interesting, I am going to have to keep a closer eye on this. I used to use symlinks but the problem is that I also have another background process that routinely rsync's the outputs to an NFS drive and that does not work with symlinks
evanbiederstedt
@evanbiederstedt
Mar 01 21:43

@sivkovic So, it sounds like you're using the same EC2 instance for the entire pipeline. And it grows.

I guess the functionality needed is either:

(A) Use the same instance throughout the pipeline. Delete all intermediate files, and scale instance accordingly.
or
(B) Mount to FSx and use the scratch there for i/o read/write. Inputs from S3, final outputs/desirables written to S3
I think this functionality exists somewhere: https://www.nextflow.io/blog/2016/enabling-elastic-computing-nextflow.html

Stephen Kelly
@stevekm
Mar 01 21:46
@anfederico
Not sure how to completely change the working directory for Nextflow but you can get creative with directing all the output to the point where the parent pipeline dir does not contain much of the outputs; examples:
https://github.com/NYU-Molecular-Pathology/lyz-nf/blob/master/Makefile#L33
https://github.com/NYU-Molecular-Pathology/lyz-nf/blob/master/nextflow.config#L49
I change the work dir, reports dir, log dirs, etc., end result being that the parent Nextflow pipeline dir stays pretty clean and the execution results are all stored in a timestamped subdir. Maybe something like that would be helpful?