These are chat archives for nextflow-io/nextflow

1st
Aug 2017
Jakob Willforss
@Jakob37
Aug 01 2017 06:48
Is it possible to organize the source code over multiple different files? For example, if I have a set of related processes performing a limited task which I would like to reuse as a sub-step in a more general workflow.
Maxime Garcia
@MaxUlysse
Aug 01 2017 10:09
@pditommaso Thanks a lot for the SNAPSHOT, it fixes everything
And by the way, I forgot to tell you, but I managed to run CAW with Singularity containers, but I only tried our simple test data, I'm waiting for our clusters to have a more recent version of Nextflow to fully use Singularity containers
Thanks a lot for all ;-)
Paolo Di Tommaso
@pditommaso
Aug 01 2017 12:53
@Jakob37 It depends, have a look here
@MaxUlysse Nice, I think the NF would benefit a lot knowing more about this use case. Would you like to write a very short blog post about that?
Maxime Garcia
@MaxUlysse
Aug 01 2017 13:01
@pditommaso With pleasure, and I do think it could be interesting too
Paolo Di Tommaso
@pditommaso
Aug 01 2017 13:07
that would be great, I would be happy to host your post in the NF blog or to cross post if you have your own blog
Maxime Garcia
@MaxUlysse
Aug 01 2017 13:12
I do have my own website and blog, but I have like give up on it, and didn't took enough time to work on a new one, so I'll opt for the NF blog
Paolo Di Tommaso
@pditommaso
Aug 01 2017 13:38
Cool, looking forward to reading it. really !
Jeff Jasper
@_jasper1918_twitter
Aug 01 2017 13:45
Just wrote my first pipeline using nextflow and was curious of best practices for using existing outputs. When I use publishdir to an EFS mount and wipe out the cache the pipeline starts from scratch when I rerun. If I want the pipeline to evaluate existing outputs what is the preferred approach? Do I have to save the cache via storedirand how is this affected by copy vs move?
Thinking of this from an AWS perspective btw
Paolo Di Tommaso
@pditommaso
Aug 01 2017 13:46
my suggestion is to store final results to S3 via publishDir and cleanup the share work dir on EFS
since it may be expensive
Jeff Jasper
@_jasper1918_twitter
Aug 01 2017 13:47
Expensive on EFS absolutely, but how can I get around rerunning the entire pipeline from scratch if I need to modify just the last process in that model?
Paolo Di Tommaso
@pditommaso
Aug 01 2017 13:48
um, in that case you keep the temporary data un EFS
another alternative is to run everything on S3
Jeff Jasper
@_jasper1918_twitter
Aug 01 2017 13:49
so temporary as in storedir is essential?
Paolo Di Tommaso
@pditommaso
Aug 01 2017 13:49
forget storeDir especially if you want to use over S3
if you want to save money keep your temporary data on S3
Jeff Jasper
@_jasper1918_twitter
Aug 01 2017 13:53
So in that case use scratch on EBS mounts and have work directory and publishDir on S3?
Paolo Di Tommaso
@pditommaso
Aug 01 2017 13:56
yes, provided you mount and specify the scratch path
NF does not (yet) mount EBS automatically, it uses the instance ephemeral storage
Jeff Jasper
@_jasper1918_twitter
Aug 01 2017 13:59
Gotcha. I will give that a go. Really have enjoyed using NF in comparison to many other popular workflow managers and see lots of potential. Keep up the great work!
Paolo Di Tommaso
@pditommaso
Aug 01 2017 13:59
I like comments like this :D
Brian Reichholf
@breichholf
Aug 01 2017 15:22
I'm having difficulties tracking down an error, again. Sorry to be a bother 😔
I've provided two input files for a workflow, but for some reason that I can't decipher one file does not go through all processes, but the other one does.
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:24
it's an runtime error or a execution logic error ?
Brian Reichholf
@breichholf
Aug 01 2017 15:25
It's not reporting any runtime errors, so I'm guessing execution logic
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:26
I see, what you are expecting ? and what's happening ?
Brian Reichholf
@breichholf
Aug 01 2017 15:28
I have 3 processes. File 1 passes through all 3, file 2 does not go in to process bowtie_hairpins
process trim_adapter {
  tag "$reads"

  input:
  file reads from rawReads

  output:
  file "*.adapter_clipped.fq.gz" into adapterClipped
  file "*.trim_report.txt" into trimResults

  script:
  prefix = reads.toString() - ~/(\.fq)?(\.fastq)?(\.gz)?$/
  """
  cutadapt -m 26 -M 38 -a $adapter -o ${prefix}.adapter_clipped.fq.gz $reads > ${prefix}.trim_report.txt
  """
}

process trim_4N {
  tag "$acReads"

  input:
  file acReads from adapterClipped

  output:
  file "*.trimmed.fq.gz" into trimmedReads

  script:
  prefix = acReads.toString() - ".adapter_clipped.fq.gz"
  """
  seqtk trimfq \\
    -b 4 \\
    -e 4 \\
    $acReads > trimmedReads.fq

  gzip -c trimmedReads.fq > ${prefix}.trimmed.fq.gz
  """
}

process bowtie_hairpins {
  tag "$trimmedReads"

  publishDir "${params.outdir}/bowtie/ext_hairpins", mode: "copy", pattern: '*.TCtagged_hairpin.bam'

  input:
  file trimmedReads
  file index from hairpinIndex

  output:
  file '${prefix}.TCtagged_hairpin.bam' into hairpinAligned

  script:
  index_base = index.toString().tokenize(' ')[0].tokenize('.')[0]
  prefix = trimmedReads.toString() - ~/(_trimmed)?(\.fq)?(\.fastq)?(\.gz)?$/
  """
  echo "${trimmedReads}" >> postBowtie.log
  bowtie  -a --best --strata -v $mismatches -S -p ${task.cpus} $index_base -q <(zcat $trimmedReads) | \
    samtools view -bS - > ${prefix}.TCtagged_hairpin.bam
  """
}
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:29
what is File1 and File2 ?
Brian Reichholf
@breichholf
Aug 01 2017 15:30
rawReads is set as
Channel
  .fromPath( params.reads )
  .ifEmpty { error "Cannot find any reads matching: ${params.reads}" }
  .set { rawReads }
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:30
that's fine
Brian Reichholf
@breichholf
Aug 01 2017 15:32
From what I can tell the file that gets through trim_adapter and trim_4N (probably simply due to size) gets processed in bowtie_hairpins, and the other doesn't.
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:32
I'm not understanding what do you mean by
File 1 passes through all 3, file 2 does not go in to process bowtie_hairpins
Brian Reichholf
@breichholf
Aug 01 2017 15:34
Ah, sorry. My expectation would be that anything in rawReads will end up going through trim_adapter, then trim_4N, and finally bowtie_hairpins.
However this is not the case, both (initial) input files are submitted to trim_adapter then trim_4N, however the one that finishes trim_4N first (apparently) is the only one that will also be used as input for bowtie_hairpins.
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:36
trim_adapter receive only what is declared as an input ie. adapterClipped
is that fine?
Brian Reichholf
@breichholf
Aug 01 2017 15:36
Yes.
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:37
so the problem is with bowtie_hairpins ?
Brian Reichholf
@breichholf
Aug 01 2017 15:38
It seems so
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:38
ok, here I see
  input:
  file trimmedReads
  file index from hairpinIndex
trimmedReads that is produced by trim_adapter
but I don't see hairpinIndex in that code
is defined somewhere ?
Brian Reichholf
@breichholf
Aug 01 2017 15:39
Yeah, it's a generic hairpin index building process.
process makeIndex {
  tag "Hairpins"

  input:
  file hairpinFasta

  output:
  file "hairpin_idx*" into hairpinIndex

  script:
  """
  samtools faidx hairpin.fa
  bowtie-build ${hairpinFasta} hairpin_idx
  """
}
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:40
and this is execute only once to create the index, right ?
Brian Reichholf
@breichholf
Aug 01 2017 15:40
Yep, exactly
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:41
so what's the problem? :D
Brian Reichholf
@breichholf
Aug 01 2017 15:41
So does bowtie_hairpins 'consume' the hairpin, and not make it accessible for parallel processes?
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:42
hairpinIndex do you mean ?
Brian Reichholf
@breichholf
Aug 01 2017 15:43
Yes
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:43
this depends how hairpinFasta is declared
can you show me the line declaring that channel ?
Brian Reichholf
@breichholf
Aug 01 2017 15:45
Sure, it's another process that takes the genome fasta file and GTF as input, and filters out relevant portions.
For brevity:
process extractHairpins {
  tag "genomePrep"

  input:
  file genomeFastaFile
  file genomeAnno

  output:
  file "hairpin.fa" into hairpinFasta

  script:
  """
  grep pre_miR $genomeAnno | \
    sed -e 's/\"//g' | \
    awk -v FS="\t" '# Split to bed' | tr ' ' '\t' > hairpin_plus20nt.bed
    zcat -f ${genomeFastaFile} > genome.fa
    samtools faidx genome.fa
    bedtools getfasta -s -fi genome.fa -bed hairpin_plus20nt.bed > hairpin.fa
  """
}
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:47
I see, if so it depends by how genomeFastaFile and genomeAnno are created .. :)
(we are going up the graph .. )
Brian Reichholf
@breichholf
Aug 01 2017 15:47
😄
if( params.genomeFasta ){
  genomeFastaFile = Channel.fromPath(params.genomeFasta)
  if( !file(params.genomeFasta).exists() ) exit 1, "Genome Fasta file not found: ${params.genomeFasta}"
}

if (params.genomeAnno) {
  genomeAnno = Channel.fromPath(params.genomeAnno)
  if (!file(params.genomeAnno).exists()) exit 1, "Genome annotation file ${params.genomeAnno} not found. Please download from flybase."
}
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:48
almost there
both of them are supposed to be a single file, right?
Brian Reichholf
@breichholf
Aug 01 2017 15:49
Yep
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:49
ok
change as the following and it will work
Brian Reichholf
@breichholf
Aug 01 2017 15:50
They're defined in an external file, and point to symlinks (not sure if that matters?)
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:50
if( params.genomeFasta ){
  genomeFastaFile = file(params.genomeFasta)
  if( !genomeFastaFile.exists() ) exit 1, "Genome Fasta file not found: ${params.genomeFasta}"
}

if (params.genomeAnno) {
  genomeAnno = file(params.genomeAnno)
  if (!genomeAnno.exists()) exit 1, "Genome annotation file ${params.genomeAnno} not found. Please download from flybase."
}
tho I would do
Brian Reichholf
@breichholf
Aug 01 2017 15:50
So, what's the difference between file() and Channel.fromPath() then?
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:51
genomeFastaFile = params.genomeFasta ? file(params.genomeFasta) : null
if( !params.genomeFasta || !genomeFastaFile.exists() ) 
   exit 1, "Genome Fasta file not found: ${params.genomeFasta}"
.. same for the genomeAnno
so the is the following
this is famous difference between a singleton channels and queue channels
when you use a Channel.fromPath you are creating a channel that can emit many files
Sergey Venev
@sergpolly
Aug 01 2017 15:54
Thank you for the clarification, Paolo @pditommaso and for the quick bug-fix (multiple-queue stuff)!
I'm wondering now which parameter to change to make nextflow wait for UNKNWN processes longer? Is it pollInterval or exitReadTimeout?
Paolo Di Tommaso
@pditommaso
Aug 01 2017 15:54
@sergpolly wait
hence when you attach a such a channel to a process
their outputs will be the same kind of channels
for this reason the channel hairpinFasta emits a single value and then terminates the execution of the process bowtie_hairpins
does make any sense ?
@sergpolly hi, you may try to increase exitReadTimeout to 5 mins or more
Sergey Venev
@sergpolly
Aug 01 2017 16:00
Got it! Will try it. Everything is running great so far, thanks again!
Paolo Di Tommaso
@pditommaso
Aug 01 2017 16:00
enjoy! :)
Brian Reichholf
@breichholf
Aug 01 2017 16:44

does make any sense ?

I see! That explanation helps a lot, yes!