These are chat archives for nextflow-io/nextflow

29th
Jan 2018
Luca Cozzuto
@lucacozzuto
Jan 29 2018 09:26
hi all, I'm wondering if there is a way to publish only some output files from a process and not others... i.e. I have a process that outputs 4 files and I need all of them for other processes but I want to publish in my output folder only 2 of them.
Paolo Di Tommaso
@pditommaso
Jan 29 2018 09:33
yes, either using a pattern glob or a saveAs returning a null for the once you want to strip away
Luca Cozzuto
@lucacozzuto
Jan 29 2018 09:34
can you show me an example? :)
(And it will be nice to have one in the official documentation)
Paolo Di Tommaso
@pditommaso
Jan 29 2018 09:35
process foo {
  publishDir '/some/path', pattern: '*.fq'
}
Luca Cozzuto
@lucacozzuto
Jan 29 2018 09:38
two extensions is?
Paolo Di Tommaso
@pditommaso
Jan 29 2018 09:40
what a lazy guy !
process foo {
  publishDir '/some/path', pattern: '*.{fq,txt}'
}
:)
or whatever, it follows Bash glob syntax (almost)
Luca Cozzuto
@lucacozzuto
Jan 29 2018 09:41
thanks! I'm not lazy, I simply dont' know :)
Paolo Di Tommaso
@pditommaso
Jan 29 2018 09:42
:sweat_smile:
Phil Ewels
@ewels
Jan 29 2018 14:56
Hi @pditommaso - wondering if you can help with a head scratcher for me..
Paolo Di Tommaso
@pditommaso
Jan 29 2018 14:56
sure
Phil Ewels
@ewels
Jan 29 2018 14:56
We have an old friend of a bug in our pipeline - a conditional process that then prevents a later process (MultiQC) not running if it doesn't execute
I tried switching that into a big if / else which has a fake closed channel in the else section, which usually does the job
however, num_bams is a variable (see above code) and so the if block just executes immediately, whilst it's still null and the process never runs
this doesn't happen when using the when approach as the process has input channels that it waits for before checking the when
Paolo Di Tommaso
@pditommaso
Jan 29 2018 14:58
sure, you cannot dod that !
Phil Ewels
@ewels
Jan 29 2018 14:58
So - can you suggest a better way to (a) make MultiQC run when this process is skipped and the sample_correlation_results output channel is generated
or (b) make the if block wait until it's ready to run
Paolo Di Tommaso
@pditommaso
Jan 29 2018 14:58
a)
add an input as
input:
val num_bams from bam_count.count()
that's all
Phil Ewels
@ewels
Jan 29 2018 14:59
and then put the logic into the script?
so it always runs, but it just doesn't do anything in the script if there aren't enough bam files?
Paolo Di Tommaso
@pditommaso
Jan 29 2018 15:00
the when should remain as it is
what's wrong with that?
Phil Ewels
@ewels
Jan 29 2018 15:01
if when returns false, then the process doesn't run and the sample_correlation_results output channel isn't created
this is an input for the later multiqc process
so then MultiQC doesn't run
So with my if/else approach I was generating sample_correlation_results = Channel.empty().close() if the process doesn't run, so that the MultiQC process is still triggered
Paolo Di Tommaso
@pditommaso
Jan 29 2018 15:02
umm .. let me see
Phil Ewels
@ewels
Jan 29 2018 15:04
this is a pretty common scenario for us by the way - part of the problem with having MultiQC at the end there just sucking up anything and everything
Paolo Di Tommaso
@pditommaso
Jan 29 2018 15:05
my point is that the when should work and you should not have any problem with the channel creation
just tried this
process foo {
  input:
  val x from Channel.from(1,2,3).count()
  output:
  val 'none' into out_ch
  when:
  x == 1

  '''
  echo true
  '''
}

out_ch.println()
ahhh, but since sample_correlation_results is empty won't trigger the multiqc step, right?
Paolo Di Tommaso
@pditommaso
Jan 29 2018 15:13
it think it use sample_correlation_results.toList() instead of sample_correlation_results.collect() here it should work
the difference is that toList returns an empty list therefore triggers the process execution even when no outputs are available
Phil Ewels
@ewels
Jan 29 2018 15:56
Sorry, got sucked into a meeting
Ok brilliant, thanks! I think we used to use toList() originally but switched it to .collect() when that was introduced :laughing:
Paolo Di Tommaso
@pditommaso
Jan 29 2018 15:58
basically that's the only difference
Phil Ewels
@ewels
Jan 29 2018 16:02
So out of curiosity, when is it better to use .collect()?
Shawn Rynearson
@srynobio
Jan 29 2018 16:03
I was just wondering if there is a way to add more control to the cache directive? I've noted that if a pipeline needs restarting many steps previously completed will restart, even if code changes haven't been made.
Paolo Di Tommaso
@pditommaso
Jan 29 2018 16:03
because in most cases if there's no output you won't execute that process
Phil Ewels
@ewels
Jan 29 2018 16:03
Gotcha - so my bug was in most cases the desired behaviour basically :smile: :+1:
Paolo Di Tommaso
@pditommaso
Jan 29 2018 16:03
:+1:
I was just wondering if there is a way to add more control to the cache directive? I've noted that if a pipeline needs restarting many steps previously completed will restart, even if code changes haven't been made.
Shawn Rynearson
@srynobio
Jan 29 2018 16:04
@pditommaso is that question clear?
Paolo Di Tommaso
@pditommaso
Jan 29 2018 16:05
if this happens there may be a some indeterminism in your pipeline or some inputs are wrongly declared
Shawn Rynearson
@srynobio
Jan 29 2018 16:13
Here is a dag of my current pipeline. So say for instance if I have to rerun my process called bam2fastq for one file that needed restarting, it's expected that I will need to re-run all of my bam2fastq processes again (for all inputs) and fastqc again (for all) even if the first time through they completed, minus the one that failed.
Paolo Di Tommaso
@pditommaso
Jan 29 2018 16:15
if you provide the same data and you specify the -resume option all successfully completed tasks are not expected to be executed again
Phil Ewels
@ewels
Jan 29 2018 16:15
:+1: They don't in our pipelines :) (we do this a lot)
Shawn Rynearson
@srynobio
Jan 29 2018 16:16
That's why I'm asking, I'm noticing that many of the files will restart.
Paolo Di Tommaso
@pditommaso
Jan 29 2018 16:17
identify the first task that's executed and should not, the problem should be there
Shawn Rynearson
@srynobio
Jan 29 2018 16:21
Currently I'm processing a small project (~50 samples), and as I stated above one file would die on my bam2fastq step. I noticed that relaunching with the -resume flag, triggered all 50 sample to re-process this step as well as all 50 fastqc steps.
I guess I'm unsure if I'm not setting up something correctly, of if there's something I can add to the process, etc to control this.
Paolo Di Tommaso
@pditommaso
Jan 29 2018 16:23
is the code shared somewhere ?
Shawn Rynearson
@srynobio
Jan 29 2018 16:24
Not currently, but I can send, etc my nf script and config file.
Paolo Di Tommaso
@pditommaso
Jan 29 2018 16:24
can you copy and paste the process definition here ?
Shawn Rynearson
@srynobio
Jan 29 2018 16:26
@pditommaso try this
Paolo Di Tommaso
@pditommaso
Jan 29 2018 16:40
the problem is this
 input:
    // thanks to @mes5K for the tokenize hint!
    set val(sample_id), val(primaryBam) from makeFQ_ch.flatMap { it.tokenize('\n') }.map{ it.tokenize(',') }
files must be properly declared as file
Shawn Rynearson
@srynobio
Jan 29 2018 16:41

so more like:

input:
set val(sample_id), file(primaryBam) from makeFQ_ch.flatMap { it.tokenize('\n') }.map{ it.tokenize(',') }

Paolo Di Tommaso
@pditommaso
Jan 29 2018 16:41
better but not enough
also the received values must be file
therefore
Shawn Rynearson
@srynobio
Jan 29 2018 16:42
They need to be files not a string
Paolo Di Tommaso
@pditommaso
Jan 29 2018 16:43
makeFQ_ch.flatMap { it.tokenize('\n') }.map{ def cols = it.tokenize(','); return [cols[0], file(col[1])] }
with that it should work, however I would refactor that code as
Shawn Rynearson
@srynobio
Jan 29 2018 16:44
so the function file(col[1]) will tell nextflow to assume this is a file.
Paolo Di Tommaso
@pditommaso
Jan 29 2018 16:44
input:
    // thanks to @mes5K for the tokenize hint!
    set val(sample_id), file(primaryBam) from makeFQ_ch.splitCsv(). map { id, path -> tuple(id, file(path)) }
a bit more readable
so the function file(col[1]) will tell nextflow to assume this is a file.
yes
Shawn Rynearson
@srynobio
Jan 29 2018 16:45
Okay, as before (and I noted) this is where it looks like some groovy magic is added.
Paolo Di Tommaso
@pditommaso
Jan 29 2018 16:46
is not magic :)
Shawn Rynearson
@srynobio
Jan 29 2018 16:46
:)
I'll test this out and review other places in the code that need this change. I think overall it makes sense; I need to be more explicit and define what is a file.
As always, thanks @pditommaso !
Paolo Di Tommaso
@pditommaso
Jan 29 2018 16:48
you are welcome!
Tim Diels
@timdiels
Jan 29 2018 17:56
Not sure if you remember the question about enabling caching on an existing workflow (not yet written in Nextflow) which uses a database for nearly all in/output; but I did the math. Splitting into multiple databases copying them when modified would increase disk space used by databases from 66GB to 96GB with the very last step (merging all DBs into one) increasing it to a total of 162GB. With the final output itself being 66GB, that's 96GB eaten up by intermediate databases. I think that's an alright price to pay for having proper caching?
Paolo Di Tommaso
@pditommaso
Jan 29 2018 18:41
barely, I've a very bad memory :/