These are chat archives for nextflow-io/nextflow

20th
Dec 2018
Stephen Ficklin
@spficklin
Dec 20 2018 00:06 UTC
thanks @rsuchecki. Yeah, it's necessary. We have 26,000 input samples and we hit 500TB storage limits if we don't clean up :-/
Rad Suchecki
@rsuchecki
Dec 20 2018 00:07 UTC
Re your actual question (re typing as my edit disappeared), channels are FIFOs so in principle in sync, but the order in which e.g. instances of process Y will finished is unknown so things will get out of sync
Stephen Ficklin
@spficklin
Dec 20 2018 00:08 UTC
Okay. I can use a groupTuple. I just wanted to make sure there wasn't a better way.
Rad Suchecki
@rsuchecki
Dec 20 2018 00:09 UTC
Stephen Ficklin
@spficklin
Dec 20 2018 00:11 UTC
I'll look at that too. Thanks for your help.
Rad Suchecki
@rsuchecki
Dec 20 2018 00:11 UTC
:+1:
One more thing - a problem to look out for: operators potentially "waiting" for all inputs before emitting the outputs
Stephen Ficklin
@spficklin
Dec 20 2018 00:15 UTC
Yeah, that will be a problem. The 'size' argument for the groupTuple() I think takes care of that right?
Rad Suchecki
@rsuchecki
Dec 20 2018 00:15 UTC
which may defeat the purpose if cleanup should be continuous
yes
Stephen Ficklin
@spficklin
Dec 20 2018 00:17 UTC
okay good. I'm really hoping it will work because right now we have to divide up our data set into 20 pieces and manage each piece independently, which believe it or not is a lot to keep up with.
Rad Suchecki
@rsuchecki
Dec 20 2018 00:34 UTC
manual management is asking for trouble; good luck! If it doesn't work, perhaps re-consider combining the processes or deleting in Y
Paolo Di Tommaso
@pditommaso
Dec 20 2018 10:31 UTC
@spficklin Use process.scratch = true to delete temporary files (tho any file declared as output: will still be there until completation)
Stijn van Dongen
@micans
Dec 20 2018 11:18 UTC
@spficklin to solve the 'waiting' problem, use the groupKeyfunction, e.g. as in .map { tag, lines -> tuple( groupKey(tag, lines.size()), lines ) }:
You should always specify the number of expected element in each tuple using the size attribute to allow the groupTuple operator to stream the collected values as soon as possible. However there are use cases in which each tuple has a different size depending grouping key. In this cases use the built-in function groupKey that allows you to create a special grouping key object to which it's possible to associate the group size for a given key.
Riccardo Giannico
@giannicorik_twitter
Dec 20 2018 14:22 UTC

Hi everyone,
I'm running nextflow with Slurm (I have a low priority "test" queue and a high priority "prod" queue).
If I submit nextflow to the 'test' queue like this sbatch -n 1 -p test nextflow myflow.nf [parameters]it works great: the first nextflow job runs on 'test' queue together with his children jobs.
If I submit nextflow to the 'prod' queue like this sbatch -n 1 -p prod nextflow myflow.nf [parameters] the first nextflow job runs on 'prod' queue but his children runs on 'test'.

Is there something I'm missing? How can I run children jobs on the same queue as their nextflow parent job without hardcoding it into the .nf file queue process parameter?

Paolo Di Tommaso
@pditommaso
Dec 20 2018 15:13 UTC
sbatch -n 1 -p prod nextflow myflow.nf -process.queue prod
Riccardo Giannico
@giannicorik_twitter
Dec 20 2018 15:19 UTC
uh cool ! thanks @pditommaso ! So -process.queue prod will work for every process in the .nf file?
Paolo Di Tommaso
@pditommaso
Dec 20 2018 15:22 UTC
yep
Riccardo Giannico
@giannicorik_twitter
Dec 20 2018 15:22 UTC
great! thanks :)
Paolo Di Tommaso
@pditommaso
Dec 20 2018 15:23 UTC
:v:
Stephen Ficklin
@spficklin
Dec 20 2018 17:26 UTC
got it! Thanks @micans
@pditommaso the issue with using the process.scratch = true is that some of the files I want to delete are needed by other processes, so they do go into an output queue. But once they are used by the other process then I don't need them anymore.
I understand it's not possible to clean those up. So I've hacked together a solution that works. Hopefully we can have functionality in the future that let's us ignore files that are no longer needed even if they are used as output.
Stijn van Dongen
@micans
Dec 20 2018 18:19 UTC
I guess it would be somewhat similar to a garbage collection algorithm .. a pretty big task to implement I would hazard. I thought there was an issue to speed up -resume functionality, but I can't find it right now (this may share some design traits). There is nextflow-io/nextflow#452.
Joe Brown
@brwnj
Dec 20 2018 19:27 UTC
In practice -resume is too fragile. Specifying the same directory for -work after updating the config triggers everything to be re-run. Is there a way to recover the correct cache for obviously completed processes?
Paolo Di Tommaso
@pditommaso
Dec 20 2018 19:46 UTC
Specifying the same directory for -work after updating the config triggers everything to be re-run
no
Joe Brown
@brwnj
Dec 20 2018 19:47 UTC
It's most likely related to the cache not being 'lenient'. There's no way to recover from the existing files in the work dir, correct?
Paolo Di Tommaso
@pditommaso
Dec 20 2018 19:50 UTC
the cache is computed hashing input files and the command script, if inputs change (or your file system reports inconsistent timestamps for it) tasks will be recomputed
Joe Brown
@brwnj
Dec 20 2018 19:51 UTC
understood. hopefully changing the cache strategy resolves this on my end
Paolo Di Tommaso
@pditommaso
Dec 20 2018 19:52 UTC
hope so