These are chat archives for nextflow-io/nextflow

23rd
May 2018
Bioninbo
@Bioninbo
May 23 2018 06:56
Hello everyone,
When I use the collect operator, it seems my file are renamed. i.e. when I use: input: file("*.bw") from channel1.collect(), I get files 1.bw, 2.bw, 3.bw... Is it a normal behavior? Can we prevent this renaming?
Paolo Di Tommaso
@pditommaso
May 23 2018 06:57
yes, it is
use file("*") if you want to keep the original file name
Bioninbo
@Bioninbo
May 23 2018 07:00
Ah right, thanks for that Paolo!
Paolo Di Tommaso
@pditommaso
May 23 2018 07:00
:+1:
Luca Cozzuto
@lucacozzuto
May 23 2018 12:00
Hi all using file("*") is giving me errors here
  output:
    file "annotation.events.ioe" into eventsFile_for_PSIcalc, eventsFile_for_DEanalysis
tool 
-i DataflowVariable(value=/nfs/software/bi/biocore_tools/git/nextflow/isoExpression/work/b8/99cbd2e662973aeacc42d78552e08f/annotation.events.ioe) -e DataflowVariable(value=/nfs/software/bi/biocore_tools/git/nextflow/isoExpression/work/47/24aae69c216deda0b190837fefa532/isoform_tpm.txt) -o ouptut_events
Paolo Di Tommaso
@pditommaso
May 23 2018 12:01
the input is not a file . .
maybe not even declared
Luca Cozzuto
@lucacozzuto
May 23 2018 12:02
?
I declared it with file ""
Paolo Di Tommaso
@pditommaso
May 23 2018 12:03
can you show the code? it's free .. ;)
Luca Cozzuto
@lucacozzuto
May 23 2018 12:04
/*
 * Make Aletrnative Splicing events from GTF
*/
process makeEvents {
    tag { annotation_file }
    label 'suppa'
    publishDir outputIndex

    input:
    file annotation_file

    output:
    file "annotation.events.ioe" into eventsFile_for_PSIcalc, eventsFile_for_DEanalysis

    script:
    """
         suppa.py generateEvents -p -f ioe -i ${annotation_file} -o annotation.events -e SE SS MX RI FL
        awk ' FNR==1 && NR!=1 { while (/^<header>/) getline; } 1 {print}' *.ioe > annotation.events.ioe    
    """
 }

/*
 * calculate PSI for each event
*/
process calcPSIforEvents {
    label 'suppa'
    publishDir outputEventsDE

    input:
    file ("*") from eventsFile_for_PSIcalc
    file ("*") from isoform_tpm_file_for_eventsPSIcalc

    output:
    file("ouptut_events.psi") into psi_file_for_splitting_events

    script:
    """
        suppa.py psiPerEvent -i ${eventsFile_for_PSIcalc} -e ${isoform_tpm_file_for_eventsPSIcalc} -o ouptut_events
    """
 }
error given is:
ERROR ~ Error executing process > 'calcPSIforEvents'

Caused by:
  Process `calcPSIforEvents` terminated with an error exit status (2)

Command executed:

  suppa.py psiPerEvent -i DataflowVariable(value=/nfs/software/bi/biocore_tools/git/nextflow/isoExpression/work/b8/99cbd2e662973aeacc42d78552e08f/annotation.events.ioe) -e DataflowVariable(value=/nfs/software/bi/biocore_tools/git/nextflow/isoExpression/work/47/24aae69c216deda0b190837fefa532/isoform_tpm.txt) -o ouptut_events
Paolo Di Tommaso
@pditommaso
May 23 2018 12:06
you are declaring
   input:
    file ("*") from eventsFile_for_PSIcalc
then in the script
script:
    """
        suppa.py psiPerEvent -i ${eventsFile_for_PSIcalc} ... 
    """
Luca Cozzuto
@lucacozzuto
May 23 2018 12:06
ouch
Paolo Di Tommaso
@pditommaso
May 23 2018 12:06
but eventsFile_for_PSIcalc is not the input file variable ..
Luca Cozzuto
@lucacozzuto
May 23 2018 12:06
yes
I see
sendivogius
@sendivogius
May 23 2018 12:07

Hi All,

I want to process (with awsbatch executor) file having () in name. When I run Nextflow I got error "syntax error near unexpected token `(". Inspecting command.run I can see:

# stage input files
rm -f 01_A\(R1\).fastq
rm -f 01_A\(R2\).fastq
aws s3 cp --only-show-errors s3://some_bucket/01_A(R1).fastq 01_A(R1).fastq
aws s3 cp --only-show-errors s3://some_bucket/01_A(R2).fastq 01_A(R2).fastq

So it looks like filenames are not escaped when copying from S3. Is there anything I can do to fix this (excluding input file renaming)?

Paolo Di Tommaso
@pditommaso
May 23 2018 12:07
:coffee: credits +1
:smile:
Luca Cozzuto
@lucacozzuto
May 23 2018 12:07
thanks!
Paolo Di Tommaso
@pditommaso
May 23 2018 12:08
@sendivogius oops, this looks like a bug, please open an issue reporting the above example
sendivogius
@sendivogius
May 23 2018 12:18
Sure, thanks!
Toni Hermoso Pulido
@toniher
May 23 2018 12:47
Hello, using 0.29.1 and a Univa (SGE) cluster, I see that queueSize seems to be ignored and many jobs are sent instead of the defined limit
example:
process {

    queue='mem_1tb,long-sl7,short-sl7'
    memory='24G'
    cpus='4'
    time='6h'
    scratch = true
    queueSize = '1'

    $buildIndex {
            queue='mem_1tb,long-sl7'
            time='24h'
            memory='60G'
            cpus='1'
    }
I would expect only 1 process by default sent to the cluster. What am I doing wrong?
Paolo Di Tommaso
@pditommaso
May 23 2018 12:48
queueSize goes into the executor scope
Toni Hermoso Pulido
@toniher
May 23 2018 12:49
ah
let me check..
Paolo Di Tommaso
@pditommaso
May 23 2018 12:49
no hurry :smile:
Toni Hermoso Pulido
@toniher
May 23 2018 12:50
so I imagine things like cpu are also being ignored...
cpus
Paolo Di Tommaso
@pditommaso
May 23 2018 12:50
no, that's fine cpus, memory, etc are define at process (task) level
queueSize is a config setting at infra level
does it make sense ?
Toni Hermoso Pulido
@toniher
May 23 2018 12:51
can I do a different queSize for every process as well?
Paolo Di Tommaso
@pditommaso
May 23 2018 12:51
ummm
you are looking for maxForks
Toni Hermoso Pulido
@toniher
May 23 2018 12:52

executor {
name = 'sge'
queueSize = 1

$trimReads {
    queueSize = 2
}

}

ups
executor {
name = 'sge'
queueSize = 1

$trimReads {
    queueSize = 2
}
}
Paolo Di Tommaso
@pditommaso
May 23 2018 12:52
nope
do you want limit the mx number of parallel exeuciton of the same task ?
Toni Hermoso Pulido
@toniher
May 23 2018 12:53
@pditommaso maxForks would work I think
Paolo Di Tommaso
@pditommaso
May 23 2018 12:53
exactly
Toni Hermoso Pulido
@toniher
May 23 2018 12:53
Yeah, I don't want some specific processes not to be run too many at the same time
because they affect negatively IO.
And other HPC users are affected...
Paolo Di Tommaso
@pditommaso
May 23 2018 12:54
I see
Toni Hermoso Pulido
@toniher
May 23 2018 12:55
I will try maxForks... thanks a lot
Paolo Di Tommaso
@pditommaso
May 23 2018 12:55
you are welcome
Toni Hermoso Pulido
@toniher
May 23 2018 12:58
Yep. It seems it's working. Thanks again :D
Paolo Di Tommaso
@pditommaso
May 23 2018 12:59
it *seems* :joy:
Toni Hermoso Pulido
@toniher
May 23 2018 12:59
It is working :D Sorry
Paolo Di Tommaso
@pditommaso
May 23 2018 12:59
kidding :D
Luca Cozzuto
@lucacozzuto
May 23 2018 14:12
@toniher welcome in the group of people bullied by @pditommaso! :)
Paolo Di Tommaso
@pditommaso
May 23 2018 14:14
I'm definitely too kind .. ! :joy:
Luca Cozzuto
@lucacozzuto
May 23 2018 14:28
well now is my turn
if (params.filter != "NO")  {
    process mappingFilter {
}
else {
    process mappingUnfiltered {
}
this is not working anymore...
there was some change?
Paolo Di Tommaso
@pditommaso
May 23 2018 14:30
that's an extract .. I guess ?
Luca Cozzuto
@lucacozzuto
May 23 2018 14:30
yes
Paolo Di Tommaso
@pditommaso
May 23 2018 14:31
however, nothing has changed
Luca Cozzuto
@lucacozzuto
May 23 2018 14:31
I got this
ERROR ~ No such variable: process
Paolo Di Tommaso
@pditommaso
May 23 2018 14:31
so you know, then
Luca Cozzuto
@lucacozzuto
May 23 2018 14:32
one sec
if (params.filter != "NO")  {
    process mappingFilter {
        tag { pair_id }
        label 'suppa'
        publishDir outputMapping, mode: 'copy'

        input:
        file index from transcriptIndex
        set pair_id, file(readsA), file(readsB) from filtered_reads_for_mapping

        output:
        file "${pair_id}" into Aln_folders, Aln_folders_for_multiqc

        script:
        NGSaligner.mapPEWithSalmon( index, readsA, readsB, "${pair_id}", "ISF", task.cpus, "--minAssignedFrags 1") 
    }
} 
else {
    process mappingUnfiltered {
        publishDir outputMapping, mode: 'copy'
        label 'suppa'

        input:
        file index from transcriptIndex
        set pair_id, file(readsA), file(readsB) from reads_for_mapping.flatten().collate(3)

        output:
        file("${pair_id}") into Aln_folders, Aln_folders_for_multiqc

        script:
        NGSaligner.mapPEWithSalmon( index, readsA, readsB, "${pair_id}", "ISF", task.cpus) 

    }
}
Maxime Garcia
@MaxUlysse
May 23 2018 14:39
have you tried with when: params.filter != "NO" and when: params.filter == "NO"?
I'm pretty sure you can even make just one process with more intelligent design
Luca Cozzuto
@lucacozzuto
May 23 2018 14:41
I was thinking to change it using when (doing in this moment) but this was a lazy experiment :)
Maxime Garcia
@MaxUlysse
May 23 2018 14:43
I'll try something like:
process mapping {
    tag { pair_id }
    label 'suppa'
    publishDir outputMapping, mode: 'copy'

    input:
    file index from transcriptIndex
    set pair_id, file(readsA), file(readsB) from filtered_reads_for_mapping

    output:
    file "${pair_id}" into Aln_folders, Aln_folders_for_multiqc

    script:
    option = params.filter != "NO" ? "--minAssignedFrags 1" : ""

    NGSaligner.mapPEWithSalmon( index, readsA, readsB, "${pair_id}", "ISF", task.cpus, "${option}")
}
Luca Cozzuto
@lucacozzuto
May 23 2018 14:44
I have also a conditional input
Maxime Garcia
@MaxUlysse
May 23 2018 14:44
Ah yes
Luca Cozzuto
@lucacozzuto
May 23 2018 14:44
filtered_reads_for_mapping or reads_for_mapping
Maxime Garcia
@MaxUlysse
May 23 2018 14:44
I'm pretty sure you can mix both of them into only one channel
Luca Cozzuto
@lucacozzuto
May 23 2018 14:45
not so easy since they are generated by two different steps..
        if (params.filter != "NO")  {
            set pair_id, file(readsA), file(readsB) from filtered_reads_for_mapping
        } else {
            set pair_id, file(readsA), file(readsB) from reads_for_mapping.flatten().collate(3)
        }
this seems not working as well
Maxime Garcia
@MaxUlysse
May 23 2018 14:47

maybe before the process with something like:

newChannel = reads_for_mapping.flatten().collate(3).mix(filtered_reads_for_mapping)

and then

set pair_id, file(readsA), file(readsB) from newChannel
in the newly made process
you can definitvely mix empty a non empty with an empty channel
Luca Cozzuto
@lucacozzuto
May 23 2018 14:48
reads_for_mapping and filtered_reads_for_mapping are mutually exclusive
or you have one or the other
Maxime Garcia
@MaxUlysse
May 23 2018 14:49
That seems logical, but you should still be able to mix them
in one case you'll have the first channel empty mixed with the second, so only the second, and vice versa
Luca Cozzuto
@lucacozzuto
May 23 2018 14:49
I'll give a try
thanks anyway for your help! :)
Maxime Garcia
@MaxUlysse
May 23 2018 14:50
We did rely a lot on if else block at the beginning, but rapidly switch to using when: as soon as possible
It really makes your code clearer
Luca Cozzuto
@lucacozzuto
May 23 2018 14:50
it is true
Maxime Garcia
@MaxUlysse
May 23 2018 14:53
And you can do magic with channels to optimize your process
Luca Cozzuto
@lucacozzuto
May 23 2018 14:54
?
Maxime Garcia
@MaxUlysse
May 23 2018 14:54
Like if you manage to merge your two processes into just one
micans
@micans
May 23 2018 16:14
Hello, new NF user here. Is this the right place to ask a question?
Paolo Di Tommaso
@pditommaso
May 23 2018 16:31
I guess so
micans
@micans
May 23 2018 16:35
Thanks ... I work for Vlad. My questions may be basic or stupid. I love NF so far, great philosophy, great execution.
Paolo Di Tommaso
@pditommaso
May 23 2018 16:36
You are welcome
micans
@micans
May 23 2018 16:36
The question is this: at the end of one of our pipelines I need to tar a number of cram files. Currently I use a publishDir directive to copy the result. However, the tar file may be of order terabyte-sized. I assume this approach has two large files; is that true and is there a way to do it only once.
Paolo Di Tommaso
@pditommaso
May 23 2018 16:43
If your storage supports (hard)links, it would be the best solution, otherwise specify move as publish option
micans
@micans
May 23 2018 16:46
Ah yes of course. link and move options to publishDir. Well I should have figured that out. Thanks!
Paolo Di Tommaso
@pditommaso
May 23 2018 16:48
well, I mean link *or* move
by default publishDir creates a symbolic link in the target directory
micans
@micans
May 23 2018 16:49
Yes sorry I got that. I mean they are both options to publishDir
Paolo Di Tommaso
@pditommaso
May 23 2018 16:50
if that does not work for you, you can use a hard-link, that does not break even if you delete the source file, but it may not work in your storage
alternatively you can move the file, as side effect it will re-execute the last step if you resume the execution
micans
@micans
May 23 2018 16:55
The move is not across file systems, so should be safe and fast. I think the side effect you mention is a good one.
Paolo Di Tommaso
@pditommaso
May 23 2018 16:55
:+1: