These are chat archives for nextflow-io/nextflow

2nd
Jun 2017
chdem
@chdem
Jun 02 2017 08:33

Hi there ! I'm still stuck in my problem. I can not find anything clean (even if @pditommaso last's solution is better than my first idea) :

  • firstly, my conditions are not exclusives because user can use one, two or three parameters, depends on what data available (params.fastq_path, params.bam_path and params.vcf_path). This is not if (condition) else but something like :

    if (params.fastq_path){
      process fastQC{
          input:
            set val(prefix), file(fastq_file) from fastq_files
    
          output:
          set file("*.zip"), file("*.html") into fastqc_output
    
          script:
          """
          fastqc -t ${task.cpus} ${fastq_file}
          """
    }
    if (params.bam_path){
      process PicardToolsMarkDuplicates{
          input:
          set val(prefix), file(bam_file) from bam_files1
    
          output:
          file("${prefix}_marked_dup_metrics.txt") into marked_dup_output
    
          """
          java -Xmx${task.memory.toGiga()}g \
          -jar \$PICARD_HOME \
          MarkDuplicates \
          I=${bam_file} \
          O=${prefix}_duplicates.bam \
          M=${prefix}_marked_dup_metrics.txt
          """
      }
    
      process PicardToolsCollectInsertSizeMetrics{
          input:
          set val(prefix), file(bam_file) from bam_files2
    
          output:
          file("${prefix}_insert_size_metrics.txt") into insert_size_metrics_output
    
          """
          java -Xmx${task.memory.toGiga()}g \
          -jar \$PICARD_HOME \
          CollectInsertSizeMetrics \
          I=${bam_file} \
          O=${prefix}_insert_size_metrics.txt \
          H=${prefix}_insert_size_histogram.pdf
          """
      }
    }
    if (params.vcf_path){
      process BcfToolsStats{
          input:
          set val(prefix), file(vcf_file) from vcf_files1
    
          output:
          file("${prefix}_bcftools.stats") into bcftools_stats_output
    
          script:
          """
          bcftools stats ${vcf_file} > ${prefix}_bcftools.stats
          """
      }
    
      process SnpEff {
          input:
          set val(prefix), file(vcf_file) from vcf_files2
    
          output:
          set file("*.csv"), file("${prefix}_snpeff.vcf") into snpeff_vcf_output
    
          script:
          """
          java -Xmx${task.memory.toGiga()}g \
          -jar \$SNPEFF_HOME \
          -csvStats ${prefix}.csv \
          GRCh37.75 \
          ${vcf_file} > ${prefix}_snpeff.vcf
          """
      }
    }
    process MultiQC {
      input:
      if (params.fastq_path){
          file ('') from fastqc_output.collect()
      }
      if (params.bam_path){
          file ('') from marked_dup_output.collect()
          file ('') from insert_size_metrics_output.collect()
      }
      if (params.vcf_path){
          file ('') from gatk_variant_eval_output.collect()
          file ('') from bcftools_stats_output.collect()
          file ('') from snpeff_vcf_output.collect()
      }
    output:
      file "*multiqc_report.html"
      file "*multiqc_data"
    
      script:
      """
      multiqc -f ${results_outdir}
      """
    }

    That's why I can not use the single downstream process :worried:

  • secondly, because I can not rely on any channel existence, I don't see how to use .mix operator or rather, I don't know on which channel I could apply .mix
Maxime Garcia
@MaxUlysse
Jun 02 2017 08:54
@chdem Have you tried something like that: https://pastebin.com/arJcnUPJ
I do think you can use mix that way
We're doing something similar in CAW
chdem
@chdem
Jun 02 2017 08:57
thank you @MaxUlysse for the github complement , pastebin is bloked in the hospital where I work :+1:
The idea is to use when statements in your processes instead of if statements that go over your processes
That way all your channel will be there (but some empty)
so no problem with .mix
chdem
@chdem
Jun 02 2017 09:01
WONDERFULL !
You make my day !
this solution is sexy !
by the way, I really love your work on the CAW workflow, your script going to help me a lot, I mean especially patients !
Maxime Garcia
@MaxUlysse
Jun 02 2017 09:04
Thanks a lot, that's very nice ;-)
I'm hoping it'll work without problems ;-)
chdem
@chdem
Jun 02 2017 09:12
For everyone here : I think of it since a moment and I never takes time to tell it to you but NextFlow completely changed our way of work here, at the hospital ! We are better in traceability, better in script organisation, better in development cycles...with Singularity usage and Slurm, It really revolutionized our way of working (while we work with Nextflow since hardly a few months)...We have some more of work but we progress faster than before!
chdem
@chdem
Jun 02 2017 09:27
There is a big "SnakeMake lobbying" here, in France but in my opinion, this is mainly because NextFlow is not universally known....and it's a shame !
Thanh Lê
@thanhleviet
Jun 02 2017 09:28
How can I flatten a data channel but still keep the file name insteal of numerics?
chdem
@chdem
Jun 02 2017 09:28
So thanks to the Nextflow community for this !!!
@thanhleviet I think flatten is not supposed to change filenames in numeric...can you give an exemple ?
Thanh Lê
@thanhleviet
Jun 02 2017 09:31
okay, I set output of files, say file_S1_B1, file_s2_B2 into assembly_output
on next process, I want to group all this files as a single input, like file_* and run in once, not seperately like file_S1_B1, and next run for file_s2_B2
so I set the assembly_output.flatten.toList()
and I got some thing link file_1, file_2
But I need to keep the real file name
chdem
@chdem
Jun 02 2017 09:34
I think this is because Nextflow do not recognize your filenames
can you show your input's first process and your output's last process ?
Thanh Lê
@thanhleviet
Jun 02 2017 09:36
First process

process SpadesAssembly {
publishDir "${params.out_dir}/SPadesContigs", mode: "copy"

tag { dataset_id }

input:
set dataset_id, file(forward), file(reverse) from spades_read_pairs

output:
set dataset_id, file("${dataset_id}_spades-contigs.fa") into (spades_assembly_results, spades_assembly_quast_contigs)

"""
spades.py --pe1-1 ${forward} --pe1-2 ${reverse} -t ${threads} -o spades_output
mv spades_output/contigs.fasta ${dataset_id}_spades-contigs.fa
"""

}

following procees
process resfinder {
    publishDir "${params.out_dir}/Resitance", mode: "copy"

    // tag {dataset_id}

    input:
    file("*.fa") from spades_assembly_results.flatten().toList()
    output:
    file "resfinder.tsv" into resfinder

    """
    abricate -minid=90 *.fa > resfinder.tsv
    """
}
chdem
@chdem
Jun 02 2017 09:37
you can't do that : spades_assembly_results contains values and files
in the resfinder input, you have to keep this in mind
Thanh Lê
@thanhleviet
Jun 02 2017 09:39
Well, it worked, but the output in resfinder was 1.fa, 3.fa, 5.fa
I got it because the apdes_assembly_results also had the value
chdem
@chdem
Jun 02 2017 09:40
yes
I'm triyng to find the code that could help you
Thanh Lê
@thanhleviet
Jun 02 2017 09:41
thank you in advance :)
chdem
@chdem
Jun 02 2017 09:42
first : do you need to keep the dataset_id in your spades_assembly_results ?
output:
file("${dataset_id}_spades-contigs.fa") into (spades_assembly_results, spades_assembly_quast_contigs)
Thanh Lê
@thanhleviet
Jun 02 2017 09:43
yes, I do need
chdem
@chdem
Jun 02 2017 09:43
ok
I think the solution is in spades_assembly_results.map{it2 -> it2[1]}
Thanh Lê
@thanhleviet
Jun 02 2017 10:06
i does not work, as 2 processes resfinder were launched
it should be one: abricate *.fa > my_outcome.txt
Thanh Lê
@thanhleviet
Jun 02 2017 10:25
@chdem I found a solution. It’s much simpler
process resfinder {
publishDir "${params.out_dir}/Resitance", mode: "copy"
    // tag {dataset_id}

    input:
    file assembled from spades_assembly_results.toList()

    output:
    file "resfinder.tsv" into resfinder

    """
    abricate -minid=90 $assembled > resfinder.tsv
    """
}
this always makes me confused between glob pattern and just a single variable holding all files
:smile:
chdem
@chdem
Jun 02 2017 10:43
Great !! :+1:
Paolo Di Tommaso
@pditommaso
Jun 02 2017 12:00
I think of it since a moment and I never takes time to tell it to you but NextFlow completely changed our way of work here, at the hospital ! We are better in traceability, better in script organisation, better in development cycles...with Singularity usage and Slurm, It really revolutionized our way of working (while we work with Nextflow since hardly a few months)...We have some more of work but we progress faster than before!
Wow, that's a very bold statement! thanks a lot @chdem
out of curiosity, what's your organisation ?
chdem
@chdem
Jun 02 2017 12:01
:smile:
Public hospital in North of France : CHRU de Lille
thank YOU @pditommaso and thanks to the community, truly !
Paolo Di Tommaso
@pditommaso
Jun 02 2017 12:03
Nice, thanks a lot for your kind words, I'm extremely proud that NF is helping in the field
chdem
@chdem
Jun 02 2017 12:03
you can be proud ! It's helping people ! Indirectly, but it's helping for sure !
Paolo Di Tommaso
@pditommaso
Jun 02 2017 12:04
That's the very reason I've started to work in bioinformatics, happy I'm not wasting my time :D
chdem
@chdem
Jun 02 2017 12:05
The team would like to be at the Nextflow Workshop in september but it's difficult for many reasons....do you plan to organize other workshop ?
Paolo Di Tommaso
@pditommaso
Jun 02 2017 12:07
not planned yet, but surely there will be another one in 2018
chdem
@chdem
Jun 02 2017 12:07
one per year ?
Paolo Di Tommaso
@pditommaso
Jun 02 2017 12:08
well, organising such events takes a lot of time
chdem
@chdem
Jun 02 2017 12:08
sure
can understand that...
Shellfishgene
@Shellfishgene
Jun 02 2017 12:10
Beginner question again: I have channel with sample names, and want the input of my process to use files such as sampleA/sampleA_contigs.fa. Do I use the dynamic file name example from the docs, or is there an easier way?
Paolo Di Tommaso
@pditommaso
Jun 02 2017 12:12
do you want to keep the original file names, right ?
Shellfishgene
@Shellfishgene
Jun 02 2017 12:14
Not sure what you mean? The question is do I need to make a file object from the sample ID sting in the channel, or do I just have val sample_id from samples_ch and then use ${x}/${x}_contigs.fa in my process?
Paolo Di Tommaso
@pditommaso
Jun 02 2017 12:15
input files need to be handled by declaring the file type, but there are two main ways
1) or by using a variable reference eg
input: 
file x from samples_ch

"""
my_command --input $x
"""
2) by proving a static file name
input: 
file 'my-contings.fa' from samples_ch

"""
my_command --input my-contings.fa
"""
Shellfishgene
@Shellfishgene
Jun 02 2017 12:18
Right, but I have samples_ch = Channel.from( 'mi', 'vf' ) and no file names yet.
Paolo Di Tommaso
@pditommaso
Jun 02 2017 12:18
in the latter example samples are staged in the task work dir with the name my-contings.fa whatever is the name of sample file
I see, but if your process need to process a file you need to give a file to it
you cannot compose the file name path in the process
Shellfishgene
@Shellfishgene
Jun 02 2017 12:20
Right, so my question was can I turn the string into a file in the process, like here
Paolo Di Tommaso
@pditommaso
Jun 02 2017 12:20
so, if I'm understood well, you will need to compose the target path by using a map operator
for example
samples_ch = Channel.from( 'mi', 'vf' ).map { name -> file("/some/base/path/${name}.fa") }
then use samples_ch in the process
does it make sense ?
Shellfishgene
@Shellfishgene
Jun 02 2017 12:22
Ah, ok. I guess a wildcard would be simple, too, like */*_contig.fa. But then I would "loose" the sample names, like mi/mi_contigs.fa for use later in the pipeline, no?
hmm, gitter is eating my *
Paolo Di Tommaso
@pditommaso
Jun 02 2017 12:24
In that case the NF way to handle this is
samples_ch  = Channel.fromPath("some/data/wildcards-here.fa").map { file -> return [ getPrefixFrom(file), file ] }

process foo {
   input:
  set sampleId, file(sampleFile) from samples_ch

  """
   command_here .. 
  """
}
where getPrefixFrom is a custom function that given a file returns the sample name/id
in the most basic form
def getPrefixFrom( file ) {
  return file.getName()
}
for example
makes sense ?
Shellfishgene
@Shellfishgene
Jun 02 2017 12:28
Yes, great, thanks. I'll just use a regex to get my ID I guess.
Paolo Di Tommaso
@pditommaso
Jun 02 2017 12:28
exactly
Shellfishgene
@Shellfishgene
Jun 02 2017 12:29
And I can use the same method to get the IDs further down the pipeline by adding them to the output, too I guess?
Paolo Di Tommaso
@pditommaso
Jun 02 2017 12:29
of course, that just a function that you can reuse as you need
Shellfishgene
@Shellfishgene
Jun 02 2017 12:30
thanks!
Paolo Di Tommaso
@pditommaso
Jun 02 2017 12:30
welcome
Jean-Christophe Houde
@jchoude
Jun 02 2017 13:20
hi nextflow experts, does anyone know if there is a minimal version of torque that needs to be installed to work correctly with nextflow?
Paolo Di Tommaso
@pditommaso
Jun 02 2017 13:20
I don't think so
do you have a specific problem ?
Jean-Christophe Houde
@jchoude
Jun 02 2017 13:22
Didn't try it, I was itneracting with our cluster manager before trying to launch anything, to make sure it was compatible. He just told me that the currently installed version is quite old (version 2), and that he feared it might cause issues
But I just wanted to see if anyone had had any issues with that before launching jobs and possibly breaking something
Thanks for the info!
Paolo Di Tommaso
@pditommaso
Jun 02 2017 13:24
NF just submit jobs by using qsub
using the -N -o -l -j options, quite standard
Jean-Christophe Houde
@jchoude
Jun 02 2017 13:25
ok cool. Thanks
Paolo Di Tommaso
@pditommaso
Jun 02 2017 13:25
You will need to use the pbs executor
Jean-Christophe Houde
@jchoude
Jun 02 2017 13:25
Yep that's what I saw in the documentation.
I'll give it a try in the coming days.
Thanks for the tool, it is quite amazing!
Paolo Di Tommaso
@pditommaso
Jun 02 2017 13:26
thank you, that's very good for my karma :D
chdem
@chdem
Jun 02 2017 13:28
Hello ! I've got this error (testing @MaxUlysse solution) :
Launching `/mnt/SeqHD_ISILON/BIO_INFO/TEAM_bioinfo_CBP/DEV/tools/multiqc/multiQC.nf` [admiring_stallman] - revision: 6376f6e0d5
ERROR ~ General error during conversion: Index: 5, Size: 5

java.lang.IndexOutOfBoundsException: Index: 5, Size: 5
        at java.util.ArrayList.rangeCheck(ArrayList.java:635)
        at java.util.ArrayList.get(ArrayList.java:411)
        at nextflow.ast.NextflowDSLImpl.convertProcessBlock(NextflowDSLImpl.groovy:272)
        at nextflow.ast.NextflowDSLImpl.convertProcessDef(NextflowDSLImpl.groovy:770)
        at nextflow.ast.NextflowDSLImpl$1.visitMethodCallExpression(NextflowDSLImpl.groovy:109)
        at org.codehaus.groovy.ast.expr.MethodCallExpression.visit(MethodCallExpression.java:66)
        at org.codehaus.groovy.ast.CodeVisitorSupport.visitExpressionStatement(CodeVisitorSupport.java:71)
what could it be ?
no more error if I delete the when: statements
strange...
ok, got it. This is BAD :
when: 
fastq_path
this is GOOD :
when: fastq_path
Paolo Di Tommaso
@pditommaso
Jun 02 2017 13:35
umm, weird
Maxime Garcia
@MaxUlysse
Jun 02 2017 13:35
What did I do this time ;-) ?
chdem
@chdem
Jun 02 2017 13:36
not you @MaxUlysse lol
it's me ;)
chdem
@chdem
Jun 02 2017 13:41
Oups, I realized what it was.....Ok, It was my mistake....sorry....
Maxime Garcia
@MaxUlysse
Jun 02 2017 13:48
What was the problem ?
Shellfishgene
@Shellfishgene
Jun 02 2017 13:49
Can you spot what's wrong here? I always get "null" for the sample name, but the regex seems fine.
def getPrefixFrom( file ) {
  def pattern = ~/.+(..)_contig.fa/
  def m = pattern.matcher(file.getName())
  if ( m.find() ) {
        return m.group(1)
  }
}

samples_ch  = Channel.fromPath("*/*_contig.fa").map { file -> return [ getPrefixFrom(file), file ] }
samples_ch.subscribe { println "$it" }
The result is [null, /sfs/fs3/work-geomar7/user/mnops_ms/mi/mi_contig.fa]
Is the third line ok, with file.getName()?
Shellfishgene
@Shellfishgene
Jun 02 2017 13:56
Figured it out, getName returns only the name not the path. Duh.
Shellfishgene
@Shellfishgene
Jun 02 2017 15:04
Can I change or add variables in the process block? For example I get a sample id as so set sample_id, file(contig_file) from samples_ch. I also need this sample id as uppercase for the command. Can I put something like def id_uc = sample_id.toUpperCase() somewhere?
Phil Ewels
@ewels
Jun 02 2017 15:19
Yes, in the script block
After script: but before the """
Shellfishgene
@Shellfishgene
Jun 02 2017 15:21
ah, didn't see that in the docs, thanks
A more general question, maybe I should to it differently: I use platanus to assemble in 3 steps: assemble with fastq R1/R2 > contig.fa, scaffold with contig.fa and fastqR1/R2 > scaffold.fa, gapfill with scaffold.fa and fastq R1/R2.
So the fastq files are used in all three steps, so I can't get them from a channel, right? Do I just point to the path for them in the command?
Phil Ewels
@ewels
Jun 02 2017 15:33
You just need to make three copies of the channel
E.g. channel.into {fq1, fq2, fq3}
(FastQ files go to FastQC and trimming)
Shellfishgene
@Shellfishgene
Jun 02 2017 15:40
Ok, makes sense. I'm somehow still confused how to make sure the right files go together when using input from multiple channels, is ist just based on the order?
Right now I have this, but I'm not sure if it's necessary:
set sample_id, file(contig_file) from samples_ch
file "${sample_id}_contigBubble.fa" from bubbles_ch
Shellfishgene
@Shellfishgene
Jun 02 2017 15:48
@ewels Why do you have collect in this line in your workflow file index from star_index.collect()? Is that not just one file in that channel?
Phil Ewels
@ewels
Jun 02 2017 18:44

files go together

How do you mean?

If you have multiple channels that need to be tied together then you need to do more clever stuff
For the star index, it is yes. But if not supplied by the user then the pipeline can generate it
So it has to be handled as a channel for the process that it goes into, in case it's dynamically generated and produced as a channel
It used to be just a file before we added that :)