These are chat archives for nextflow-io/nextflow

18th
Oct 2017
Simone Baffelli
@baffelli
Oct 18 2017 08:41
Morning
is it normal that "!{task.workDir}" is null in a shell enviornment?
Paolo Di Tommaso
@pditommaso
Oct 18 2017 08:41
morning, today were are bit late .. :)
Simone Baffelli
@baffelli
Oct 18 2017 08:42
I started at 7:30 :innocent:
Evan Floden
@evanfloden
Oct 18 2017 08:42
Who is "we" :sleeping:
Paolo Di Tommaso
@pditommaso
Oct 18 2017 08:42
LOL
yes, that is expected because shell task should not use abs paths
Simone Baffelli
@baffelli
Oct 18 2017 08:44
:ok:
too bad
Paolo Di Tommaso
@pditommaso
Oct 18 2017 08:45
use shell \$PWD if you need the current path
Simone Baffelli
@baffelli
Oct 18 2017 08:45
I want to fill the file slcTab before the command is executed. This is my current attempt
    shell:
        slcTabStr = createStackingTable(slcs as List, slcsPar as List)
        slcTabTextObj = new File("!{task.workDir}/slcTab")
        slcTabTextObj.text = slcTabStr
        '''
        SLC2pt slcTab !{plist} - pSlcPar pSlc
        '''
can I just create a file with slcTabTextObj = new File("slcTab")
or that would not work?
Paolo Di Tommaso
@pditommaso
Oct 18 2017 08:46
I don't think so
use a map to do that and feed the process with the resulting channel
Simone Baffelli
@baffelli
Oct 18 2017 08:47
I was using the lazy mans approach so far
echo "!{slcTabStr}" > slcTab
SLC2pt slcTab !{plist} - pSlcPar pSlc
but it is not very elegant
Paolo Di Tommaso
@pditommaso
Oct 18 2017 08:51
it's fine ! :)
Simone Baffelli
@baffelli
Oct 18 2017 08:53
I don't like it in the .command.sh file :smile:
especially since the list consists of 250 files
Luca Cozzuto
@lucacozzuto
Oct 18 2017 09:03
PS: when you use a container... will be nice to have something in the .command.sh like singularity run CONTAINER.img and the command
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:05
you can find that in the .command.run
Luca Cozzuto
@lucacozzuto
Oct 18 2017 09:05
nice!
Simone Baffelli
@baffelli
Oct 18 2017 09:19
I must annoy you again
Exception evaluating property 'length' for BlankSeparatedList1_groovyProxy
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:20
umm use .size() instead of .length
Simone Baffelli
@baffelli
Oct 18 2017 09:20
Is this not the correct way of getting the lengthnSlc = (slcs as List).length
@pditommaso you deserve a prize for your patience :trophy:
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:21
try nSlc = slcs.size()
Simone Baffelli
@baffelli
Oct 18 2017 09:21
it works
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:21
:+1:
nobel prize :sweat_smile:
Simone Baffelli
@baffelli
Oct 18 2017 09:21
Helpfulness prize
:smile:
Francesco Strozzi
@fstrozzi
Oct 18 2017 09:42
hello, if I have a process emitting on the stdout and I want the output to be present in 2 different channels, is it ok to do something like that ?
output:
stdout encode_files_ch_1
stdout encode_files_ch_2
or is it bad ?
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:44
it should be ok
Francesco Strozzi
@fstrozzi
Oct 18 2017 09:46
it seems I am getting a bit of an unpredictable behaviour. Sometimes I have the 2 processes using these channels as input that start together, sometimes I have just one of the 2 starting
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:49
umm, sure are you using two times the same channel ?
Francesco Strozzi
@fstrozzi
Oct 18 2017 09:50
sure+
now for example I should have 2 processes running, fastqc and quant and just fastqc is submitted and running
the other is not even submitted by NF
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:51
what is you change it to
Francesco Strozzi
@fstrozzi
Oct 18 2017 09:51
there was also some caching involved, since previous processes were already run. So I don’t know if it is the way I have declared the two output channels or not
Paolo Di Tommaso
@pditommaso
Oct 18 2017 09:51
output:
stdout into (encode_files_ch_1, encode_files_ch_2)
Francesco Strozzi
@fstrozzi
Oct 18 2017 09:52
I’ll give it a try thanks
Francesco Strozzi
@fstrozzi
Oct 18 2017 13:45
this is what I see with the output specified like that. I am testing this with just 2 samples and those two channels are feeding two processes called fastqc and quant. I see 2 fastqc task, but only 1 quant
[warm up] executor > awsbatch
[52/20865e] Cached process > index (Homo_sapiens.GRCh38.cdna.all.fa.gz)
[6f/5ce471] Submitted process > parseEncode (s3://bucket/encode-test/metadata.small.tsv)
[59/14129f] Submitted process > fastqc (FASTQC on SRR3192620)
[95/e89a6d] Submitted process > fastqc (FASTQC on SRR5210435)
[28/c7e4f6] Submitted process > quant (SRR5210435)
Paolo Di Tommaso
@pditommaso
Oct 18 2017 13:45
I should see the code
Francesco Strozzi
@fstrozzi
Oct 18 2017 13:47
That’s the parseEncode process (skipping the Python script emitting on stdout but that’s not the problem)
process parseEncode {

    tag "$params.metadata"

    cpus 2

    memory '4 GB'

    input:
    file(metadata) from Channel.fromPath(params.metadata)

    output:
    stdout into (encode_files_ch_1, encode_files_ch_2)
and then the 2 receiving processes
process quant {

    tag "$dbxref"

    cpus 8

    memory '8 GB'

    input:
    file index from index_ch
    set dbxref,sample_type,strand_specific,url from encode_files_ch_1.splitCsv()

    output:
    file("${sample_type}-${dbxref}") into quant_ch

    script:
    def libType = strand_specific == "True" ? "S" : "U"
    """
    wget ${url}/${dbxref}_1.fastq.gz
    wget ${url}/${dbxref}_2.fastq.gz
    salmon quant --threads $task.cpus --libType=${libType} -i index -1 ${dbxref}_1.fastq.gz -2 ${dbxref}_2.fastq.gz -o ${sample_type}-${dbxref}
    """
}

process fastqc {

    tag "FASTQC on $dbxref"

    cpus 2

    memory '8 GB'

    input:
    set dbxref,sample_type,strand_specific,url from encode_files_ch_2.splitCsv()

    output:
    file("fastqc_${dbxref}_logs") into fastqc_ch


    script:
    """
    wget ${url}/${dbxref}_1.fastq.gz
    wget ${url}/${dbxref}_2.fastq.gz
    mkdir fastqc_${dbxref}_logs
    fastqc -o fastqc_${dbxref}_logs -f fastq -q ${dbxref}_1.fastq.gz ${dbxref}_2.fastq.gz
    """
}
Paolo Di Tommaso
@pditommaso
Oct 18 2017 13:50
how is declared index_ch ?
Francesco Strozzi
@fstrozzi
Oct 18 2017 13:50
process index {

    tag "$transcriptome"

    cpus 4

    memory '30 GB'

    input:
    file(transcriptome) from Channel.fromPath(params.transcriptome)

    output:
    file 'index' into index_ch

    script:
    """
    salmon index --threads $task.cpus -t $transcriptome -i index
    """
}
this comes from the RNASeq-NF example
Paolo Di Tommaso
@pditommaso
Oct 18 2017 13:51
not exactly ..
it's
    input:
    file transcriptome from transcriptome_file
then look at transcriptome_file
it's
transcriptome_file = file(params.transcriptome)
not
transcriptome_file = Channel.fromPath(params.transcriptome)
so what's the difference ?
Francesco Strozzi
@fstrozzi
Oct 18 2017 13:58
mmm this is still quite confusing
but we are talking about the input here, it’s seems it’s the output channel causing the problems
Paolo Di Tommaso
@pditommaso
Oct 18 2017 13:59
yes, you hit one of the most trickiest part of NF, but I'm going to example the logic
so when the index process is declared in this way
process index {

    tag "$transcriptome"

    cpus 4

    memory '30 GB'

    input:
    file(transcriptome) from Channel.fromPath(params.transcriptome)

    output:
    file 'index' into index_ch

    script:
    """
    salmon index --threads $task.cpus -t $transcriptome -i index
    """
}
you are saying the input can received many transcriptome
Channel.fromPath create channel queue that can emit many file depending the glob specified
being so index_ch is an output channel queue that can produce many output, tho in practice will produce just one
right so far ?
Francesco Strozzi
@fstrozzi
Oct 18 2017 14:02
yep
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:03
fantastic, now what happens is that the downstream process has two inputs that are two channel *queues*
the first emitting *one* item and the other more than one
Francesco Strozzi
@fstrozzi
Oct 18 2017 14:04
yes
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:04
however the semantic of the process is to stop the computation as soon as there's a channel with no more content
the process get an input form each channel queue and launch the execution of a task
since there's no input more for index_ch it stops
makes sense ?
Francesco Strozzi
@fstrozzi
Oct 18 2017 14:07
mmm from last discussion about this and the precedence of channels, I thought that the channel with less elements was always consumed last
but how then declaring file transcriptome from transcriptome_file affects also the output channel ?
this is unclear
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:08
exactly
because in this case the from part is not a channel but a (file) value
doing that NF knows that that process will be executed just *once* in any case
Francesco Strozzi
@fstrozzi
Oct 18 2017 14:10
and so the output channel basically it’s not a channel
in this case
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:10
because there are not inputs with a queue brining multiple elements
the output is, what we call, a *singleton* channel aka dataflow variable
Francesco Strozzi
@fstrozzi
Oct 18 2017 14:12
ok makes sense
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:12
this kind of channel/variable can be read many times and it returns always the same value
now
Venkat Malladi
@vsmalladi
Oct 18 2017 14:12
i am trying to collect the outputs of a channel into a file
using collectFile
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:12
when you have a process sin which there's a combinations of channel queues and singleton channels
Venkat Malladi
@vsmalladi
Oct 18 2017 14:13
so far i have set sampleId, file('.bam'), file('.bai'), biosample, factor, treatment, replicate, controlId into dedupReads
And then output
dedupDesign = dedupReads
.collectFile(name:'design_dedup.tsv', seed:"sample_id\tbam_reads\tbam_index\tbiosample\tfactor\ttreatment\treplicate\tcontrolId\n", storeDir:"$baseDir/output/design")
however the collectFile makes a new file for each row
do I have to flatten it all
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:14
@fstrozzi the singleton value is applied for each input provided by the input queues
Francesco Strozzi
@fstrozzi
Oct 18 2017 14:14
ok
thanks for the explanation
now makes sense
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:15
nice, this is a bit tricky but once you get works beautifully
Francesco Strozzi
@fstrozzi
Oct 18 2017 14:16
:+1:
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:17
@vsmalladi how is the structure of dedupReads ?
Venkat Malladi
@vsmalladi
Oct 18 2017 14:17
@pditommaso
set sampleId, file('.bam'), file('.bai'), biosample, factor, treatment, replicate, controlId into dedupReads
so I expect an array
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:18
well, collectFile is design to concatenate outputs to a file
Venkat Malladi
@vsmalladi
Oct 18 2017 14:18
which is what i expected
so right now I have the deduplication process running on 4 files
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:19
you want to merge the bam files ?
Venkat Malladi
@vsmalladi
Oct 18 2017 14:19
no i want a csv file that has the metadata in it
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:19
I see
Venkat Malladi
@vsmalladi
Oct 18 2017 14:19
maybe i am approaching it wrong
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:19
that's not what collectFile does
Venkat Malladi
@vsmalladi
Oct 18 2017 14:19
ah okay
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:20
but you can use to do that
collectFile get files and merge them to one or more files
Venkat Malladi
@vsmalladi
Oct 18 2017 14:20
okay that makes more sense
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:21
to create a metadata csv you need to have it return a string line for of your csv, for example
dedupReads.collectFile {  sample, bam, bai -> "$sample,$bam,$bai" }.set { csv_file_ch }
Venkat Malladi
@vsmalladi
Oct 18 2017 14:23
ah okay
then it will output into a particular file
perfect
Thanks @pditommaso
Paolo Di Tommaso
@pditommaso
Oct 18 2017 14:25
welcome
hope it works .. :)
Venkat Malladi
@vsmalladi
Oct 18 2017 14:26
thanks
cedrixic
@cedrixic
Oct 18 2017 15:53
Hi, quick (and maybe stupid) question : I want to write a process using python. This process should output a bunch of files (in the nextflow context). How do I specify within the python script that the output of this script is a file (in summary : what is the best option to replace a bash printf, in python? a simple file.open(), write and close?)
and maybe my question is not clear at all :)
Paolo Di Tommaso
@pditommaso
Oct 18 2017 15:56
exactly.. :)
cedrixic
@cedrixic
Oct 18 2017 15:57
ok :) so let's take the sample code bit on nextflow website

process pyStuff {
"""

#!/usr/bin/python
x = 'Hello'
y = 'world!'
print "%s - %s" % (x,y)
"""

}

Paolo Di Tommaso
@pditommaso
Oct 18 2017 15:58
almost there
wrap the code in triple ` then new-line
cedrixic
@cedrixic
Oct 18 2017 15:59
yes
Paolo Di Tommaso
@pditommaso
Oct 18 2017 15:59
however that's just a basic example
cedrixic
@cedrixic
Oct 18 2017 15:59
and now, let say I don't want to print stuff, but get output files from this process
it is :) but it's just to illustrate
Paolo Di Tommaso
@pditommaso
Oct 18 2017 16:00
your python code, need to save those file in the work dir
cedrixic
@cedrixic
Oct 18 2017 16:00
ok
Paolo Di Tommaso
@pditommaso
Oct 18 2017 16:01
that helps ?
cedrixic
@cedrixic
Oct 18 2017 16:02
let me try this, I'll let you know :)
thanks
Paolo Di Tommaso
@pditommaso
Oct 18 2017 16:02
:+1:
Evan Floden
@evanfloden
Oct 18 2017 16:03

I guess you want something like, in the script:
print "%s - %s" % (x,y) > myoutput.txt

and above this in the output part of the process"
file('myoutput.txt') into pyOutCh

cedrixic
@cedrixic
Oct 18 2017 16:04
indeed
cedrixic
@cedrixic
Oct 18 2017 16:18
awesome, everything works
Evan Floden
@evanfloden
Oct 18 2017 16:18
:+1:
cedrixic
@cedrixic
Oct 18 2017 16:18
@skptic and @pditommaso thanks for the help !
Venkat Malladi
@vsmalladi
Oct 18 2017 16:32
@cedrixic I have written all of my processes to call python hope to share link soon
cedrixic
@cedrixic
Oct 18 2017 16:32
@vsmalladi sounds good!
Shawn Rynearson
@srynobio
Oct 18 2017 18:46
Quick nextflow question: is there any level of modularization allow which at run time would allow you to choose which (from a list) process to include?
similar to scripting method calls.