These are chat archives for nextflow-io/nextflow

5th
Nov 2018
Maxime HEBRARD
@mhebrard
Nov 05 2018 03:47
Question: Is there a way to specify : from one channel containing a list of files, I wish to run file1 > Process A > process B then after file 2 > process A > process B
currently my flow run File 1> Process A then File 2 > process A, then File 1>process B, then File 2>Process B
the problem here is that I need to store the process A output of all my files in order to run process B ..... I would prefer running process A + B then publish output of file 1, before processing file 2 => (saving space)
Tobias Neumann
@t-neumann
Nov 05 2018 08:26
@MaxUlysse I'm just playing around myself, but I suppose you will have to attach those 1000G as EBS to your EC2 instance where you would like to create the AMI from. see also https://www.nextflow.io/docs/latest/awscloud.html#custom-ami
@pditommaso Don't know if you ever had one of these, but is there a way to config nextflow to switch to different (more resourceful) AWS batch queues based upon task.attempt?
Maxime Garcia
@MaxUlysse
Nov 05 2018 08:30
@t-neumann I read this one already, but not very detailled on how to proceed exactly
@mhebrard I'm not sure it follows the paradigm on which Nf is based, from what I understand, when something can be run, it will be run, if you really need to chain your processes A+B into only one, why not making just one process then?
Tobias Neumann
@t-neumann
Nov 05 2018 08:39
@MaxUlysse Did you already startup an EC2 with the attached EBS and an Amazon ECS-Optimized Amazon Linux AMI as base image?
Alexander Peltzer
@apeltzer
Nov 05 2018 08:42
I'm probably too stupid to find it: Having users to specify --index <path_to_folder> I'd like to use the content of that folder in another process. I looked at some pipelines doing similar things (nf-core/rnaseq, nf-core/chipseq, nf-core/methylseq) but it doesn't work as expected unfortunately...
Tried using .collect(), .first()... but I never get all files in the folder to be in the folder using the indices for mapping
Tintest
@Tintest
Nov 05 2018 08:58

Hello, I got a question,

I'm reading a path from a text file and I would like to know how to "give it" to a fromPath channel.

Channel
         .fromPath('/bettik/tintest/SPARK/illumina/ID.txt')
         .splitText()
         .set {mutect_ID}

    process spark_ID {
        errorStrategy 'finish'
        maxForks params.maxJob
        cpus params.nCpu
        echo true

        input:
        val input_ID from mutect_ID

        output:
        set val("${patientno}"), val("${tumorID}"), val("${healthyID}") into mutect2_ID_ch
        val("${tumorbampath}") into tumor_bam_ch
        val("${healthybampath}") into healthy_bam_ch



        script:

        line = input_ID.toString().trim().split('\t')
        patientno = line[0]
        tumorbampath = line[1]
        tumorbamname = line[2]
        tumorID = line[3]
        healthybampath = line[4]
        healthybamname = line[5]
        healthyID = line[6]

        println "$patientno $tumorID $healthyID"


        """

        """

    }

In this example, tumorbampath and healthybampath are my paths, but if they are initialized as files, they are "out of the scope of process working dir".

Thank you.

Maxime HEBRARD
@mhebrard
Nov 05 2018 09:03
@MaxUlysse yes I guess that is what I will do .... at first I was thinking "one process = one software run" so I wanted to separate my step A and step B .... but I guess I can use nexflow for the "parallel" part but keep the sequencial part in one same process Y > Z > A+B > C
Maxime HEBRARD
@mhebrard
Nov 05 2018 09:17
but that break a bit the "module" idea no ?
A = mapping , B = samtools sort + index
then why not build one big process ?
Z+A+B+C ...
Maxime Garcia
@MaxUlysse
Nov 05 2018 09:27
@t-neumann Inb my EC2 instance, I have an attached EBS based on the Amazon ECS-Optimized Amazon Linux AMI
docker info | grep -i data

 Data Space Used: 302.5MB
 Data Space Total: 1.061TB
 Data Space Available: 1.061TB
 Metadata Space Used: 782.3kB
 Metadata Space Total: 1.074GB
 Metadata Space Available: 1.073GB
But when launching jobs, I do get this error on my process:
nxf-scratch-dir ip-172-31-13-96:/tmp/nxf.f6NUr4kgwz
download failed: s3://sarek-dream-test/tumor/T_D0EN0.8_1.fastq.gz to ./T_D0EN0.8_1.fastq.gz [Errno 28] No space left on device
download failed: s3://sarek-dream-test/tumor/T_D0EN0.8_2.fastq.gz to ./T_D0EN0.8_2.fastq.gz [Errno 28] No space left on device
Maxime Garcia
@MaxUlysse
Nov 05 2018 09:36
So I'm guessing I need to do something else in my EC2 instance, but don't really know why, what or how
Tobias Neumann
@t-neumann
Nov 05 2018 09:38
Hm ok that all looks good. How big are the respective fastq.gz files?
Maxime Garcia
@MaxUlysse
Nov 05 2018 09:39
less than 1T, no problem on that
arround 9 Gb
Tobias Neumann
@t-neumann
Nov 05 2018 09:40
what does this say?
docker info | grep -i base
Maxime Garcia
@MaxUlysse
Nov 05 2018 09:40
  Base Device Size: 10.74GB
Tobias Neumann
@t-neumann
Nov 05 2018 09:41
ah ok. you gotta scale this up also. what will be the total file size of a single task round about?
Maxime Garcia
@MaxUlysse
Nov 05 2018 09:42
OK, how can I scale this up?
No idea what the total size of a single task should be, but that's why I put 1 T
and then restart the docker daemon and rerun the info command to see if it worked
Maxime Garcia
@MaxUlysse
Nov 05 2018 09:45
 Base Device Size: 536.9GB
Seems to be working
I'll try that
Thanks
micans
@micans
Nov 05 2018 11:33
@Tintest -- perhaps your script can make a symlink or hardlink to those files, and then send a file("${tumorbampath}") in the output channel? Then you will have the files in the working directory. That's the only local solution I can see, barring a more global restructuring of your workflow. Perhaps there are other/better solutions.
Tobias Neumann
@t-neumann
Nov 05 2018 11:40
@pditommaso how would I go about switching to a different AWS batch queue with increasing task.attempts in the config?
Tintest
@Tintest
Nov 05 2018 11:51

@micans, thank you for your answer, I did it, but then : File `/bettik/tintest/SPARK/illumina/S668_D49_C000F49_MD_BSQR2.bam` is out of the scope of process working dir: /bettik/tintest/PROJECTS/Test_nextflow_OAR/work/8e/a0f44eb09db860f1a6734400c64e90

I think I got a workaround which is really dirty :

    process spark_ID {
        errorStrategy 'finish'
        maxForks params.maxJob
        cpus params.nCpu
        echo true

        input:
        val input_ID from mutect_ID

        output:
        set val("${patientno}"), val("${tumorID}"), val("${healthyID}") into mutect2_ID_ch
        file("${tumorbamname}") into tumor_bam_ch
        file("${healthybamname}") into healthy_bam_ch



        script:

        line = input_ID.toString().trim().split('\t')
        patientno = line[0]
        tumorbampath = line[1]
        tumorbamname = line[2]
        tumorID = line[3]
        healthybampath = line[4]
        healthybamname = line[5]
        healthyID = line[6]

        println "$patientno $tumorID $healthyID"


        """

        ln -s ${tumorbampath}
        ln -s ${healthybampath}

        touch ${tumorbamname}
        touch ${healthybamname}

        """

    }

I know it's not following the nextflow paradigm, but in this case I must parse a text file to get some pipeline parameters. I might apply this logic to some more critical step of my routine pipeline (this example is just a side work) and I'm still looking for an elegant solution.

Thank you.

micans
@micans
Nov 05 2018 11:59
Hi @Tintest, this is what I meant, simply ln -s (or ln) in the script section. Now that I think about it again, you can do the splitting in a channel and receive something like set val(patientno), file(...) etc in the input. Still not clear what you are doing though. ${tumorbamname} is that a file created by the script section or does it already exist?
Tintest
@Tintest
Nov 05 2018 12:05

It already exists. ${tumorbamname} is just the bam name, without the absolute path. I could have it by splitting in my process, but while testing different combination to make it work, I did it because it was easier :D

Here is a example of input_ID (my file ID.txt) :

patient1 /bettik/tintest/SPARK/illumina/S668_D4B_C000F4B_MD_BSQR2.bam S668_D4B_C000F4B_MD_BSQR2.bam S668_D4B /bettik/tintest/SPARK/illumina/S668_D4C_C000F4C_MD_BSQR2.bam S668_D4C_C000F4C_MD_BSQR2.bam S668_D4C

Riccardo Giannico
@giannicorik_twitter
Nov 05 2018 12:26

Hi guys,
What if I have a channel created like this Channel.fromFilePairs("${params.dir}/*.bam",size:1) {file -> file.name.split(/.bam/)[0] } .set{samplelist} containing something like this:

[sample1, [/my/data/sample1.bam ]]
[sample2, [/my/data/sample2.bam ]]
[sample3, [/my/data/sample3.bam ]]

and I want to collect all the bam files in the next process obtaining a variable like this sample1.bam, sample2.bam, sample3.bam ? (sorting not needed)

I tried with this:

input:
file(bamlist) from samplelist.collect()
"""
echo "${bamlist}"
"""

but I get this: input.1 file3.out input.3 file1.out input.5 file2.outinstead of this sample1.bam sample2.bam, sample3.bam

Benjamin Wingfield
@nebfield
Nov 05 2018 12:47
Is it possible to specify docker runOptions per-process instead of the global docker scope?
I have a single process that I'd like to run with nvidia-docker
Thomas Zichner
@zichner
Nov 05 2018 12:51
@giannicorik_twitter Likely, the problem is that the channel samplelist contains not only the bam files but also the sample IDs. You can try something like this: samplelist.map{ it -> it[1][0] }.collect()to first extract the actual file names.
micans
@micans
Nov 05 2018 12:57

(back from lunch) @Tintest if it already exists then you can create a tuple in the channel. Something like this could work I think:

Channel
.fromPath('/bettik/tintest/SPARK/illumina/ID.txt')
.splitText()
.map { it.trim() }
.map { it -> it.split('\t') }
.map { [ "$it[0]", "$it[1]/$it[2]" ] }
.set {mutect_ID}

And then accept e.g. val(foo), file(zut) from the channel, obviously reworked to suit your purpose.

Tintest
@Tintest
Nov 05 2018 12:59
I'll try ! Thank you @micans ! :)
micans
@micans
Nov 05 2018 13:00
I hope it does, I'm not sure to be honest!
Particularly not whether the file() will work.
Alexander Peltzer
@apeltzer
Nov 05 2018 13:09
Trying to do this here (two processes that are just running depending on certain input parameters): a.) is not a process but an if(params.bwa_index) then pushes these files to a channel and b.) is a process creating the indices. According to the example, I'd need to use ch_bwa_index.mix(ch_bwa_index_existing) (for example) in the third process but that leads to this error:
ERROR ~ No such variable: ch_bwa_index_existing

 -- Check script 'main.nf' at line: 582 or see '.nextflow.log' file for more details
Martin Proks
@matq007
Nov 05 2018 13:09
Hey guys, does anyone know how I can trigger singularity image to be writable because the tool I'm using is trying to create a file inside itself in singularity?
Alexander Peltzer
@apeltzer
Nov 05 2018 13:09
When I initialize these channels like this: ch_bwa_index_existing = Channel.create() they're not closed anymore and my pipeline hangs :-(

Hey guys, does anyone know how I can trigger singularity image to be writable because the tool I'm using is trying to create a file inside itself in singularity?

Can't you use custom commands, e.g. running with --writableor is such an option directive only available in Docker?

micans
@micans
Nov 05 2018 13:14
@apeltzer with if branches and missing channels I always need to create an empty channel in one of the two branches (if or else).
Alexander Peltzer
@apeltzer
Nov 05 2018 13:18
Thanks - just wondered whether it’s me
It seems that there is a difference between if/else clauses and actual processes when resolving the existence of a variable :-)
micans
@micans
Nov 05 2018 13:19
The alternative, I'm sure you know, is use when: if possible. I've been able to remove some of my if logic that way.
Martin Proks
@matq007
Nov 05 2018 13:21

Yeah, I've tried to add parameter runOptions = "--writable", but then I'm getting

ERROR  : Unable to open squashfs image in read-write mode: Read-only file system
ABORT  : Retval = 255

It's probably because I'm pulling the image manually

Alexander Peltzer
@apeltzer
Nov 05 2018 13:21

Yeah, I've tried to add parameter runOptions = "--writable", but then I'm getting
ERROR : Unable to open squashfs image in read-write mode: Read-only file system ABORT : Retval = 255
It's probably because I'm pulling the image manually

Yes, quite likely the case :-(

Martin Proks
@matq007
Nov 05 2018 13:22
Has anyone had this issue before? Otherwise I have to modify the tool :\
Alexander Peltzer
@apeltzer
Nov 05 2018 13:22

The alternative, I'm sure you know, is use when: if possible. I've been able to remove some of my if logic that way.

Yeah, same here. But I'm trying to either load data from a user defined reference folder or create it in a process. The latter is using the when: directive, but the former is hard to set up that way ...

Alexander Peltzer
@apeltzer
Nov 05 2018 13:33
Yeah, now the processes are hanging -.-
micans
@micans
Nov 05 2018 13:34
Do you use Channel.empty()?
I see a create() above
@apeltzer my use case e.g. looks like this:
if (params.studyid > 0) {
    ch_fastqs_dir = Channel.empty()
    ....
}
Alexander Peltzer
@apeltzer
Nov 05 2018 13:36
You're right
micans
@micans
Nov 05 2018 13:38
I was a bit slow in putting 1 and 1 together, as I had seen create() and mentioned empty() :wink:
Alexander Peltzer
@apeltzer
Nov 05 2018 13:45
;-) Thanks for the help - works now
micans
@micans
Nov 05 2018 13:51
:+1:
Riccardo Giannico
@giannicorik_twitter
Nov 05 2018 14:16
@zichner thanks for the suggestion, man! This is the actual working syntax :) samplelist.map{ it[1] }.collect() great! :D
Tintest
@Tintest
Nov 05 2018 14:42

@matq007 I think you have to build the singularity image as writable the first time you are building / pulling it, like the following : sudo singularity build --writable gatk-4.0.4.0.img docker://broadinstitute/gatk:4.0.4.0

Then runOptions = "--writable" should work

Will Furnass
@WillFurnass_twitter
Nov 05 2018 15:26
Hi. Is there an alternative to curl -fsSL get.nextflow.io|bash for installing NextFlow? Can I just download and unpack a tarball from https://github.com/nextflow-io/nextflow/releases?
Alexander Peltzer
@apeltzer
Nov 05 2018 15:26
Yes
Just download the "*-all" package, chmod +x and execute it
(java still required, that is not bundled with it)
Will Furnass
@WillFurnass_twitter
Nov 05 2018 15:29
@apeltzer Thanks! Would be great if the docs listed that as an alternative installation method - sysadmins may be put off installing NextFlow centrally as curl -fsSL get.nextflow.io|bash is less easy to audit.
Martin Proks
@matq007
Nov 05 2018 15:59
@Tintest yeah I don't think I have sudo permissions, I've modified the script, it wasn't complicated at the end :)