These are chat archives for nextflow-io/nextflow

20th
Dec 2018
Stephen Ficklin
@spficklin
Dec 20 2018 00:06
thanks @rsuchecki. Yeah, it's necessary. We have 26,000 input samples and we hit 500TB storage limits if we don't clean up :-/
Rad Suchecki
@rsuchecki
Dec 20 2018 00:07
Re your actual question (re typing as my edit disappeared), channels are FIFOs so in principle in sync, but the order in which e.g. instances of process Y will finished is unknown so things will get out of sync
Stephen Ficklin
@spficklin
Dec 20 2018 00:08
Okay. I can use a groupTuple. I just wanted to make sure there wasn't a better way.
Rad Suchecki
@rsuchecki
Dec 20 2018 00:09
Stephen Ficklin
@spficklin
Dec 20 2018 00:11
I'll look at that too. Thanks for your help.
Rad Suchecki
@rsuchecki
Dec 20 2018 00:11
:+1:
One more thing - a problem to look out for: operators potentially "waiting" for all inputs before emitting the outputs
Stephen Ficklin
@spficklin
Dec 20 2018 00:15
Yeah, that will be a problem. The 'size' argument for the groupTuple() I think takes care of that right?
Rad Suchecki
@rsuchecki
Dec 20 2018 00:15
which may defeat the purpose if cleanup should be continuous
yes
Stephen Ficklin
@spficklin
Dec 20 2018 00:17
okay good. I'm really hoping it will work because right now we have to divide up our data set into 20 pieces and manage each piece independently, which believe it or not is a lot to keep up with.
Rad Suchecki
@rsuchecki
Dec 20 2018 00:34
manual management is asking for trouble; good luck! If it doesn't work, perhaps re-consider combining the processes or deleting in Y
Paolo Di Tommaso
@pditommaso
Dec 20 2018 10:31
@spficklin Use process.scratch = true to delete temporary files (tho any file declared as output: will still be there until completation)
micans
@micans
Dec 20 2018 11:18
@spficklin to solve the 'waiting' problem, use the groupKeyfunction, e.g. as in .map { tag, lines -> tuple( groupKey(tag, lines.size()), lines ) }:
You should always specify the number of expected element in each tuple using the size attribute to allow the groupTuple operator to stream the collected values as soon as possible. However there are use cases in which each tuple has a different size depending grouping key. In this cases use the built-in function groupKey that allows you to create a special grouping key object to which it's possible to associate the group size for a given key.
Riccardo Giannico
@giannicorik_twitter
Dec 20 2018 14:22

Hi everyone,
I'm running nextflow with Slurm (I have a low priority "test" queue and a high priority "prod" queue).
If I submit nextflow to the 'test' queue like this sbatch -n 1 -p test nextflow myflow.nf [parameters]it works great: the first nextflow job runs on 'test' queue together with his children jobs.
If I submit nextflow to the 'prod' queue like this sbatch -n 1 -p prod nextflow myflow.nf [parameters] the first nextflow job runs on 'prod' queue but his children runs on 'test'.

Is there something I'm missing? How can I run children jobs on the same queue as their nextflow parent job without hardcoding it into the .nf file queue process parameter?

Paolo Di Tommaso
@pditommaso
Dec 20 2018 15:13
sbatch -n 1 -p prod nextflow myflow.nf -process.queue prod
Riccardo Giannico
@giannicorik_twitter
Dec 20 2018 15:19
uh cool ! thanks @pditommaso ! So -process.queue prod will work for every process in the .nf file?
Paolo Di Tommaso
@pditommaso
Dec 20 2018 15:22
yep
Riccardo Giannico
@giannicorik_twitter
Dec 20 2018 15:22
great! thanks :)
Paolo Di Tommaso
@pditommaso
Dec 20 2018 15:23
:v:
Stephen Ficklin
@spficklin
Dec 20 2018 17:26
got it! Thanks @micans
@pditommaso the issue with using the process.scratch = true is that some of the files I want to delete are needed by other processes, so they do go into an output queue. But once they are used by the other process then I don't need them anymore.
I understand it's not possible to clean those up. So I've hacked together a solution that works. Hopefully we can have functionality in the future that let's us ignore files that are no longer needed even if they are used as output.
micans
@micans
Dec 20 2018 18:19
I guess it would be somewhat similar to a garbage collection algorithm .. a pretty big task to implement I would hazard. I thought there was an issue to speed up -resume functionality, but I can't find it right now (this may share some design traits). There is nextflow-io/nextflow#452.
Joe Brown
@brwnj
Dec 20 2018 19:27
In practice -resume is too fragile. Specifying the same directory for -work after updating the config triggers everything to be re-run. Is there a way to recover the correct cache for obviously completed processes?
Paolo Di Tommaso
@pditommaso
Dec 20 2018 19:46
Specifying the same directory for -work after updating the config triggers everything to be re-run
no
Joe Brown
@brwnj
Dec 20 2018 19:47
It's most likely related to the cache not being 'lenient'. There's no way to recover from the existing files in the work dir, correct?
Paolo Di Tommaso
@pditommaso
Dec 20 2018 19:50
the cache is computed hashing input files and the command script, if inputs change (or your file system reports inconsistent timestamps for it) tasks will be recomputed
Joe Brown
@brwnj
Dec 20 2018 19:51
understood. hopefully changing the cache strategy resolves this on my end
Paolo Di Tommaso
@pditommaso
Dec 20 2018 19:52
hope so