These are chat archives for nextflow-io/nextflow

17th
Jul 2017
Shellfishgene
@Shellfishgene
Jul 17 2017 12:45
I don't unterstand why I get No such variable: pair_id here:
#!/usr/bin/env nextflow

params.reads = "$baseDir/raw_data_name_mod/*_R{1,2}.fastq.gz"
params.index = "$baseDir/pipefish"

Channel
    .fromFilePairs( params.reads )
    .ifEmpty { error "Cannot find any reads matching: ${params.reads}" }
    .set { fastq_ch } 

Channel
    .value( params.index )
    .set { index_prefix } 

process rsem {
   publishDir "rsem_out_nf", mode: 'symlink'

   input:
   set val(index) from index_prefix
   set pair_id, file(reads) from fastq_ch

   output:
   set pair_id, file '$id.genes.results' into rsem_results

   """
   rsem-calculate-expression -p ${task.cpus} --paired-end --bowtie2 --estimate-rspd --append-names ${reads} ${index} ${pair_id}
   """
}
Evan Floden
@evanfloden
Jul 17 2017 12:46
in output, did you try file with ()?
Shellfishgene
@Shellfishgene
Jul 17 2017 12:49
@skptic That was ist, still not sure why that results in that error though...
Evan Floden
@evanfloden
Jul 17 2017 12:50
np. Glad it helped
Shellfishgene
@Shellfishgene
Jul 17 2017 12:53
Now I have a warning, WARN: Input 'set' must define at least two component -- Check process 'rsem' Does this mean the pair_id channel does not contain two files?
Paolo Di Tommaso
@pditommaso
Jul 17 2017 12:53
the warn is referring to set val(index) from index_prefix
Evan Floden
@evanfloden
Jul 17 2017 12:54
I think you do not need set in that input.
the val index is a single item.
Hence, no need for set.
Shellfishgene
@Shellfishgene
Jul 17 2017 12:56
Ok, so I just remove the set ?
Evan Floden
@evanfloden
Jul 17 2017 12:56
:+1:
Shellfishgene
@Shellfishgene
Jul 17 2017 13:13
My output line in the above is now set pair_id, file('*.genes.results') into rsem_results. How would I do this if I want to specify the output file as $pair_id.genes.results?
Evan Floden
@evanfloden
Jul 17 2017 13:14
try: set pair_id, file(“${pair_id}.genes.results”) into rsem_results
Shellfishgene
@Shellfishgene
Jul 17 2017 13:18
Ah, that's where the {} come in again. Thanks.
Shellfishgene
@Shellfishgene
Jul 17 2017 13:56
Sorry, another basic question: How do I get a space-separated string for the command from a collected channel?
Paolo Di Tommaso
@pditommaso
Jul 17 2017 13:57
what do u mean exactly ?
Shellfishgene
@Shellfishgene
Jul 17 2017 13:59
The RSEM process outputs a bunch of *.genes files, and the next process should use them all together as the input, rsem-generate-data-matrix 01.genes 02.genes 03.genes ... > output_matrix.txt
I though that would work using the collect operator on the output channel, but I'm not sure how and could not find an example.
Paolo Di Tommaso
@pditommaso
Jul 17 2017 14:01
as long as you capture all of them in a single output
Shellfishgene
@Shellfishgene
Jul 17 2017 14:02
Just found the split and gather example, I'll have a look
Paolo Di Tommaso
@pditommaso
Jul 17 2017 14:02
you can specify altogether in a single input
Shellfishgene
@Shellfishgene
Jul 17 2017 14:04
input:
   file '*.genes.results' from rsem_results

output:
   file 'tmp_matrix.txt'

   """
   rsem-generate-data-matrix-tpm *.genes.results > tpm_matrix.txt
  """
Like this?
Paolo Di Tommaso
@pditommaso
Jul 17 2017 14:05
use just * to keep the original names
even better
input:
   file genes from rsem_results
output:
   file 'tmp_matrix.txt'

   """
   rsem-generate-data-matrix-tpm $genes > tpm_matrix.txt
  """
Shellfishgene
@Shellfishgene
Jul 17 2017 14:06
Hmm, so why would nf know to wait for all files, and not do this one by one?
Paolo Di Tommaso
@pditommaso
Jul 17 2017 14:08
how is declared the upstream task ?
Shellfishgene
@Shellfishgene
Jul 17 2017 14:08
output:
   set pair_id, file('*.genes.results') into rsem_results
Paolo Di Tommaso
@pditommaso
Jul 17 2017 14:08
ok
hover don't forget the structure output/input must much hence
input:
   file pair_id, file(genes) from rsem_results
(sorry before I forgot file)
does make sense ?
Shellfishgene
@Shellfishgene
Jul 17 2017 14:10
No ;)
Paolo Di Tommaso
@pditommaso
Jul 17 2017 14:10
what are you missing ? :)
Shellfishgene
@Shellfishgene
Jul 17 2017 14:10
Would that not run rsem-generate-data-matrix-tpm once for each sample?
Paolo Di Tommaso
@pditommaso
Jul 17 2017 14:11
nope, because all output files are captured in the same emission
thus it will trigger one task with all that files
Shellfishgene
@Shellfishgene
Jul 17 2017 14:12
Hmm, I'll have to look at this some more...
In your example above, is file pair_id correct, or did you mean set pair_id?
Paolo Di Tommaso
@pditommaso
Jul 17 2017 14:14
yes sorry
input:
   set pair_id, file(genes) from rsem_results
I'm doing too many things together ..
Shellfishgene
@Shellfishgene
Jul 17 2017 14:14
Just so I understand, how would you change that line to make nf process one file at a time?
Paolo Di Tommaso
@pditommaso
Jul 17 2017 14:15
do you need the pair_id ?
Shellfishgene
@Shellfishgene
Jul 17 2017 14:16
I just don't understand why set pair_id, file(genes) from rsem_results collects all files from that channel and runs them in one command. It's exacly how I write the input line if I want to process one sample at a time?
Paolo Di Tommaso
@pditommaso
Jul 17 2017 14:17
it does not collect
they are already collect because the output
output:
   set pair_id, file('*.genes.results') into rsem_results
send out all of the them together
is this clear ?
Shellfishgene
@Shellfishgene
Jul 17 2017 14:18
Ah, I'm sorry, that's my mistake. There is only one file that matches that. I did not correct it yet, it should be set pair_id, file('${pair_id}.genes.results') into rsem_results
And that outputs one sample at a time.
Paolo Di Tommaso
@pditommaso
Jul 17 2017 14:20
does it solve your problem ?
Shellfishgene
@Shellfishgene
Jul 17 2017 14:21
I'll try with * and see if it does what I want. Sorry for taking your time...
Paolo Di Tommaso
@pditommaso
Jul 17 2017 14:30
I think won't work as you are expecting
You are trying to implement a scatter pattern, one to many tasks, right
?
Shellfishgene
@Shellfishgene
Jul 17 2017 14:44
No, it's really simple, I have 10 samples, first process is mapping and counting, results in 10 sampleX.genes.results files. The next process takes all of the 10 files and makes one big table out of them. That is rsem-generate-data-matrix-tpm *genes.results > tpm_matrix.txt.
Paolo Di Tommaso
@pditommaso
Jul 17 2017 14:58
so it's the opposite, you have a mapping task for each mapping and then you want to collect all of them
Shellfishgene
@Shellfishgene
Jul 17 2017 14:59
yes
Paolo Di Tommaso
@pditommaso
Jul 17 2017 14:59
ok
so you only need to use collect to gather all of them
Ido Tamir
@idot
Jul 17 2017 15:01
Hi, is it possible to declare a global publishDir?
Paolo Di Tommaso
@pditommaso
Jul 17 2017 15:02
yes, in the config file
Sergey Venev
@sergpolly
Jul 17 2017 15:08
Hi, I have a couple of questions here:
1) is it possible to use dynamic directives and dynamic computing resources in the config file? What would be the syntax?
2) The second question stems from an example: Let's say I have a Channel of ~100 chunks that I want to merge into one. In a cluster environment it's benefitial to do it hierarchically, i.e., merging in groups of 4 would result in ~25 premerged chunks - now 25 is still too many and I'd like to run the same merging process again. Here is the question, is there a way in nextflow to use process recursively or organize some kind of for loop to cycle my channel of chunks through, until there is only 1 merged chunk left?
Paolo Di Tommaso
@pditommaso
Jul 17 2017 15:09
question 1) yes. you need to wrap the directive in curly brackets eg. process.something = { etc }
Sergey Venev
@sergpolly
Jul 17 2017 15:13
time = { $task.attempt<2 ? '2h' : '4h' }
}
Paolo Di Tommaso
@pditommaso
Jul 17 2017 15:13
question 2) are you referring to external sorting ? in that case is done automatically by NF when using collectFile
yes (obviously w/o the extra })
Sergey Venev
@sergpolly
Jul 17 2017 15:15
Yes, sorry it was a typo - I'll try that! Thanks! I was wrapping that into quotes earlier and it wasn't working
I'll read about collectFile, but I doubt it would help: we're doing essentially a merge --sort (yes it does look like external sorting is the term for it) for a bunch of zipped files, so
we're unzipping chunks and zipping bigger chunks back
Paolo Di Tommaso
@pditommaso
Jul 17 2017 15:19
I see collectFile cannot work for zipped files
Sergey Venev
@sergpolly
Jul 17 2017 15:19
Moreover I want to do each instance of the pre-merge process to run on different nodes
I think unzipping/zipping is our biggest bottleneck here - requires lots of cpu
that's why when it's done on a single node for all ~100 chunks - it struggles
Paolo Di Tommaso
@pditommaso
Jul 17 2017 15:21
well, more of nodes I would think in terms of cpus
Sergey Venev
@sergpolly
Jul 17 2017 15:21
I was trying to do recursion in nextflow - but it didn't work: trying to redirect a channel back into the same process using matching names
Paolo Di Tommaso
@pditommaso
Jul 17 2017 15:22
could parallel zip help ? https://zlib.net/pigz/
Sergey Venev
@sergpolly
Jul 17 2017 15:22
we're using pbgzip already
Paolo Di Tommaso
@pditommaso
Jul 17 2017 15:22
I see
Sergey Venev
@sergpolly
Jul 17 2017 15:23
it scales for compression and does not scale for decompression
Paolo Di Tommaso
@pditommaso
Jul 17 2017 15:23
what about having a step doing the premerge and a second step for the final one ?
Sergey Venev
@sergpolly
Jul 17 2017 15:23
that's the way I have it now
Paolo Di Tommaso
@pditommaso
Jul 17 2017 15:23
what's bad with that ?
Sergey Venev
@sergpolly
Jul 17 2017 15:23
I was just wondering if i could do it "beatifully" with recursion or something in a dynamic fashion
sometimes I'd do 1 premerge - other times - 2
I can have these premerges there and execute them
Paolo Di Tommaso
@pditommaso
Jul 17 2017 15:24
no recursion sorry
Sergey Venev
@sergpolly
Jul 17 2017 15:24
conditionally, based on the size of the channel
But of course I cannot have more premerges that are hardcoded in the pipeline code
I see about recursion. That's understandable, I guess.
Will keep multiple hardcoded steps then, and maybe control their number from the config file. User would control it, i guess.
Shawn Rynearson
@srynobio
Jul 17 2017 18:49
Question about restartability. I noticed and read that Nextflow has a resume option, but based on what i've reviewed, it's really only for resuming based on code changes. So does Nextflow have a method to resume based on data. e.g. I'm processing 100 bam files, but an error occurs at bam 90. If I restart will Nextflow just continue from the 90th bam, or are all files reprocessed? How does this change for each step in the pipeline (including the last step). Thanks
Paolo Di Tommaso
@pditommaso
Jul 17 2017 18:52
yes, it will skip all processes already successfully executed
Shawn Rynearson
@srynobio
Jul 17 2017 18:56
the resume option is not needed then
Paolo Di Tommaso
@pditommaso
Jul 17 2017 18:57
yes, it's needed. the resuming mechanism is only activated when specifying the -resume option
Shawn Rynearson
@srynobio
Jul 17 2017 18:58
great, I'll test some more. Thanks!
Paolo Di Tommaso
@pditommaso
Jul 17 2017 18:58
:+1: