These are chat archives for nextflow-io/nextflow

18th
Oct 2017
Simone Baffelli
@baffelli
Oct 18 2017 08:41 UTC
Morning
is it normal that "!{task.workDir}" is null in a shell enviornment?
Paolo Di Tommaso
@pditommaso
Oct 18 2017 08:41 UTC
morning, today were are bit late .. :)
Simone Baffelli
@baffelli
Oct 18 2017 08:42 UTC
I started at 7:30 :innocent:
Evan Floden
@evanfloden
Oct 18 2017 08:42 UTC
Who is "we" :sleeping:
Paolo Di Tommaso
@pditommaso
Oct 18 2017 08:42 UTC
LOL
yes, that is expected because shell task should not use abs paths
Simone Baffelli
@baffelli
Oct 18 2017 08:44 UTC
:ok:
too bad
Paolo Di Tommaso
@pditommaso
Oct 18 2017 08:45 UTC
use shell \$PWD if you need the current path
Simone Baffelli
@baffelli
Oct 18 2017 08:45 UTC
I want to fill the file slcTab before the command is executed. This is my current attempt
    shell:
        slcTabStr = createStackingTable(slcs as List, slcsPar as List)
        slcTabTextObj = new File("!{task.workDir}/slcTab")
        slcTabTextObj.text = slcTabStr
        '''
        SLC2pt slcTab !{plist} - pSlcPar pSlc
        '''
can I just create a file with slcTabTextObj = new File("slcTab")
or that would not work?
Paolo Di Tommaso
@pditommaso
Oct 18 2017 08:46 UTC
I don't think so
use a map to do that and feed the process with the resulting channel
Simone Baffelli
@baffelli
Oct 18 2017 08:47 UTC
I was using the lazy mans approach so far
echo "!{slcTabStr}" > slcTab
SLC2pt slcTab !{plist} - pSlcPar pSlc
but it is not very elegant
Paolo Di Tommaso
@pditommaso
Oct 18 2017 08:51 UTC
it's fine ! :)
Simone Baffelli
@baffelli
Oct 18 2017 08:53 UTC
I don't like it in the .command.sh file :smile:
especially since the list consists of 250 files
Luca Cozzuto
@lucacozzuto
Oct 18 2017 09:03 UTC
PS: when you use a container... will be nice to have something in the .command.sh like singularity run CONTAINER.img and the command
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:05 UTC
you can find that in the .command.run
Luca Cozzuto
@lucacozzuto
Oct 18 2017 09:05 UTC
nice!
Simone Baffelli
@baffelli
Oct 18 2017 09:19 UTC
I must annoy you again
Exception evaluating property 'length' for BlankSeparatedList1_groovyProxy
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:20 UTC
umm use .size() instead of .length
Simone Baffelli
@baffelli
Oct 18 2017 09:20 UTC
Is this not the correct way of getting the lengthnSlc = (slcs as List).length
@pditommaso you deserve a prize for your patience :trophy:
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:21 UTC
try nSlc = slcs.size()
Simone Baffelli
@baffelli
Oct 18 2017 09:21 UTC
it works
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:21 UTC
:+1:
nobel prize :sweat_smile:
Simone Baffelli
@baffelli
Oct 18 2017 09:21 UTC
Helpfulness prize
:smile:
Francesco Strozzi
@fstrozzi
Oct 18 2017 09:42 UTC
hello, if I have a process emitting on the stdout and I want the output to be present in 2 different channels, is it ok to do something like that ?
output:
stdout encode_files_ch_1
stdout encode_files_ch_2
or is it bad ?
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:44 UTC
it should be ok
Francesco Strozzi
@fstrozzi
Oct 18 2017 09:46 UTC
it seems I am getting a bit of an unpredictable behaviour. Sometimes I have the 2 processes using these channels as input that start together, sometimes I have just one of the 2 starting
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:49 UTC
umm, sure are you using two times the same channel ?
Francesco Strozzi
@fstrozzi
Oct 18 2017 09:50 UTC
sure+
now for example I should have 2 processes running, fastqc and quant and just fastqc is submitted and running
the other is not even submitted by NF
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:51 UTC
what is you change it to
Francesco Strozzi
@fstrozzi
Oct 18 2017 09:51 UTC
there was also some caching involved, since previous processes were already run. So I don’t know if it is the way I have declared the two output channels or not
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:51 UTC
output:
stdout into (encode_files_ch_1, encode_files_ch_2)
Francesco Strozzi
@fstrozzi
Oct 18 2017 09:52 UTC
I’ll give it a try thanks
Francesco Strozzi
@fstrozzi
Oct 18 2017 13:45 UTC
this is what I see with the output specified like that. I am testing this with just 2 samples and those two channels are feeding two processes called fastqc and quant. I see 2 fastqc task, but only 1 quant
[warm up] executor > awsbatch
[52/20865e] Cached process > index (Homo_sapiens.GRCh38.cdna.all.fa.gz)
[6f/5ce471] Submitted process > parseEncode (s3://bucket/encode-test/metadata.small.tsv)
[59/14129f] Submitted process > fastqc (FASTQC on SRR3192620)
[95/e89a6d] Submitted process > fastqc (FASTQC on SRR5210435)
[28/c7e4f6] Submitted process > quant (SRR5210435)
Paolo Di Tommaso
@pditommaso
Oct 18 2017 13:45 UTC
I should see the code
Francesco Strozzi
@fstrozzi
Oct 18 2017 13:47 UTC
That’s the parseEncode process (skipping the Python script emitting on stdout but that’s not the problem)
process parseEncode {

    tag "$params.metadata"

    cpus 2

    memory '4 GB'

    input:
    file(metadata) from Channel.fromPath(params.metadata)

    output:
    stdout into (encode_files_ch_1, encode_files_ch_2)
and then the 2 receiving processes
process quant {

    tag "$dbxref"

    cpus 8

    memory '8 GB'

    input:
    file index from index_ch
    set dbxref,sample_type,strand_specific,url from encode_files_ch_1.splitCsv()

    output:
    file("${sample_type}-${dbxref}") into quant_ch

    script:
    def libType = strand_specific == "True" ? "S" : "U"
    """
    wget ${url}/${dbxref}_1.fastq.gz
    wget ${url}/${dbxref}_2.fastq.gz
    salmon quant --threads $task.cpus --libType=${libType} -i index -1 ${dbxref}_1.fastq.gz -2 ${dbxref}_2.fastq.gz -o ${sample_type}-${dbxref}
    """
}

process fastqc {

    tag "FASTQC on $dbxref"

    cpus 2

    memory '8 GB'

    input:
    set dbxref,sample_type,strand_specific,url from encode_files_ch_2.splitCsv()

    output:
    file("fastqc_${dbxref}_logs") into fastqc_ch


    script:
    """
    wget ${url}/${dbxref}_1.fastq.gz
    wget ${url}/${dbxref}_2.fastq.gz
    mkdir fastqc_${dbxref}_logs
    fastqc -o fastqc_${dbxref}_logs -f fastq -q ${dbxref}_1.fastq.gz ${dbxref}_2.fastq.gz
    """
}
Paolo Di Tommaso
@pditommaso
Oct 18 2017 13:50 UTC
how is declared index_ch ?
Francesco Strozzi
@fstrozzi
Oct 18 2017 13:50 UTC
process index {

    tag "$transcriptome"

    cpus 4

    memory '30 GB'

    input:
    file(transcriptome) from Channel.fromPath(params.transcriptome)

    output:
    file 'index' into index_ch

    script:
    """
    salmon index --threads $task.cpus -t $transcriptome -i index
    """
}
this comes from the RNASeq-NF example
Paolo Di Tommaso
@pditommaso
Oct 18 2017 13:51 UTC
not exactly ..
it's
    input:
    file transcriptome from transcriptome_file
then look at transcriptome_file
it's
transcriptome_file = file(params.transcriptome)
not
transcriptome_file = Channel.fromPath(params.transcriptome)
so what's the difference ?
Francesco Strozzi
@fstrozzi
Oct 18 2017 13:58 UTC
mmm this is still quite confusing
but we are talking about the input here, it’s seems it’s the output channel causing the problems
Paolo Di Tommaso
@pditommaso
Oct 18 2017 13:59 UTC
yes, you hit one of the most trickiest part of NF, but I'm going to example the logic
so when the index process is declared in this way
process index {

    tag "$transcriptome"

    cpus 4

    memory '30 GB'

    input:
    file(transcriptome) from Channel.fromPath(params.transcriptome)

    output:
    file 'index' into index_ch

    script:
    """
    salmon index --threads $task.cpus -t $transcriptome -i index
    """
}
you are saying the input can received many transcriptome
Channel.fromPath create channel queue that can emit many file depending the glob specified
being so index_ch is an output channel queue that can produce many output, tho in practice will produce just one
right so far ?
Francesco Strozzi
@fstrozzi
Oct 18 2017 14:02 UTC
yep
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:03 UTC
fantastic, now what happens is that the downstream process has two inputs that are two channel *queues*
the first emitting *one* item and the other more than one
Francesco Strozzi
@fstrozzi
Oct 18 2017 14:04 UTC
yes
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:04 UTC
however the semantic of the process is to stop the computation as soon as there's a channel with no more content
the process get an input form each channel queue and launch the execution of a task
since there's no input more for index_ch it stops
makes sense ?
Francesco Strozzi
@fstrozzi
Oct 18 2017 14:07 UTC
mmm from last discussion about this and the precedence of channels, I thought that the channel with less elements was always consumed last
but how then declaring file transcriptome from transcriptome_file affects also the output channel ?
this is unclear
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:08 UTC
exactly
because in this case the from part is not a channel but a (file) value
doing that NF knows that that process will be executed just *once* in any case
Francesco Strozzi
@fstrozzi
Oct 18 2017 14:10 UTC
and so the output channel basically it’s not a channel
in this case
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:10 UTC
because there are not inputs with a queue brining multiple elements
the output is, what we call, a *singleton* channel aka dataflow variable
Francesco Strozzi
@fstrozzi
Oct 18 2017 14:12 UTC
ok makes sense
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:12 UTC
this kind of channel/variable can be read many times and it returns always the same value
now
Venkat Malladi
@vsmalladi
Oct 18 2017 14:12 UTC
i am trying to collect the outputs of a channel into a file
using collectFile
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:12 UTC
when you have a process sin which there's a combinations of channel queues and singleton channels
Venkat Malladi
@vsmalladi
Oct 18 2017 14:13 UTC
so far i have set sampleId, file('.bam'), file('.bai'), biosample, factor, treatment, replicate, controlId into dedupReads
And then output
dedupDesign = dedupReads
.collectFile(name:'design_dedup.tsv', seed:"sample_id\tbam_reads\tbam_index\tbiosample\tfactor\ttreatment\treplicate\tcontrolId\n", storeDir:"$baseDir/output/design")
however the collectFile makes a new file for each row
do I have to flatten it all
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:14 UTC
@fstrozzi the singleton value is applied for each input provided by the input queues
Francesco Strozzi
@fstrozzi
Oct 18 2017 14:14 UTC
ok
thanks for the explanation
now makes sense
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:15 UTC
nice, this is a bit tricky but once you get works beautifully
Francesco Strozzi
@fstrozzi
Oct 18 2017 14:16 UTC
:+1:
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:17 UTC
@vsmalladi how is the structure of dedupReads ?
Venkat Malladi
@vsmalladi
Oct 18 2017 14:17 UTC
@pditommaso
set sampleId, file('.bam'), file('.bai'), biosample, factor, treatment, replicate, controlId into dedupReads
so I expect an array
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:18 UTC
well, collectFile is design to concatenate outputs to a file
Venkat Malladi
@vsmalladi
Oct 18 2017 14:18 UTC
which is what i expected
so right now I have the deduplication process running on 4 files
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:19 UTC
you want to merge the bam files ?
Venkat Malladi
@vsmalladi
Oct 18 2017 14:19 UTC
no i want a csv file that has the metadata in it
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:19 UTC
I see
Venkat Malladi
@vsmalladi
Oct 18 2017 14:19 UTC
maybe i am approaching it wrong
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:19 UTC
that's not what collectFile does
Venkat Malladi
@vsmalladi
Oct 18 2017 14:19 UTC
ah okay
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:20 UTC
but you can use to do that
collectFile get files and merge them to one or more files
Venkat Malladi
@vsmalladi
Oct 18 2017 14:20 UTC
okay that makes more sense
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:21 UTC
to create a metadata csv you need to have it return a string line for of your csv, for example
dedupReads.collectFile {  sample, bam, bai -> "$sample,$bam,$bai" }.set { csv_file_ch }
Venkat Malladi
@vsmalladi
Oct 18 2017 14:23 UTC
ah okay
then it will output into a particular file
perfect
Thanks @pditommaso
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:25 UTC
welcome
hope it works .. :)
Venkat Malladi
@vsmalladi
Oct 18 2017 14:26 UTC
thanks
cedrixic
@cedrixic
Oct 18 2017 15:53 UTC
Hi, quick (and maybe stupid) question : I want to write a process using python. This process should output a bunch of files (in the nextflow context). How do I specify within the python script that the output of this script is a file (in summary : what is the best option to replace a bash printf, in python? a simple file.open(), write and close?)
and maybe my question is not clear at all :)
Paolo Di Tommaso
@pditommaso
Oct 18 2017 15:56 UTC
exactly.. :)
cedrixic
@cedrixic
Oct 18 2017 15:57 UTC
ok :) so let's take the sample code bit on nextflow website

process pyStuff {
"""

#!/usr/bin/python
x = 'Hello'
y = 'world!'
print "%s - %s" % (x,y)
"""

}

Paolo Di Tommaso
@pditommaso
Oct 18 2017 15:58 UTC
almost there
wrap the code in triple ` then new-line
cedrixic
@cedrixic
Oct 18 2017 15:59 UTC
yes
Paolo Di Tommaso
@pditommaso
Oct 18 2017 15:59 UTC
however that's just a basic example
cedrixic
@cedrixic
Oct 18 2017 15:59 UTC
and now, let say I don't want to print stuff, but get output files from this process
it is :) but it's just to illustrate
Paolo Di Tommaso
@pditommaso
Oct 18 2017 16:00 UTC
your python code, need to save those file in the work dir
cedrixic
@cedrixic
Oct 18 2017 16:00 UTC
ok
Paolo Di Tommaso
@pditommaso
Oct 18 2017 16:01 UTC
that helps ?
cedrixic
@cedrixic
Oct 18 2017 16:02 UTC
let me try this, I'll let you know :)
thanks
Paolo Di Tommaso
@pditommaso
Oct 18 2017 16:02 UTC
:+1:
Evan Floden
@evanfloden
Oct 18 2017 16:03 UTC

I guess you want something like, in the script:
print "%s - %s" % (x,y) > myoutput.txt

and above this in the output part of the process"
file('myoutput.txt') into pyOutCh

cedrixic
@cedrixic
Oct 18 2017 16:04 UTC
indeed
cedrixic
@cedrixic
Oct 18 2017 16:18 UTC
awesome, everything works
Evan Floden
@evanfloden
Oct 18 2017 16:18 UTC
:+1:
cedrixic
@cedrixic
Oct 18 2017 16:18 UTC
@skptic and @pditommaso thanks for the help !
Venkat Malladi
@vsmalladi
Oct 18 2017 16:32 UTC
@cedrixic I have written all of my processes to call python hope to share link soon
cedrixic
@cedrixic
Oct 18 2017 16:32 UTC
@vsmalladi sounds good!
Shawn Rynearson
@srynobio
Oct 18 2017 18:46 UTC
Quick nextflow question: is there any level of modularization allow which at run time would allow you to choose which (from a list) process to include?
similar to scripting method calls.