These are chat archives for nextflow-io/nextflow

8th
Apr 2016
Paolo Di Tommaso
@pditommaso
Apr 08 2016 14:05 UTC
@jbyars nextflow keeps all intermediate files in the work dir directory, thus most of times it's enough to delete this folder.
However I guess you are thinking to a different strategy, could you please elaborate it a bit more. Tx
Jason Byars
@jbyars
Apr 08 2016 16:56 UTC
I'm thinking of situations like GATK where the pipeline can generate considerable amounts of intermediate data that's no longer needed after the first couple processes. 2 or 3 processes in when I know I don't need stuff from process 1, it might be nice to eliminate it. When I'm running locally I generally don't care, because my shared space is many TB. When I'm running on AWS, shared usually takes the form of a static size EBS volume. Which means I either guess large or really have to work out how much a job will need.
Paolo Di Tommaso
@pditommaso
Apr 08 2016 16:58 UTC
I see. If these data is NOT declared as an output in any process, it would be easy to implement a cleanup strategy
Jason Byars
@jbyars
Apr 08 2016 16:58 UTC
that's what I was thinking. I was wondering if any sort of mechanism was in place.
Paolo Di Tommaso
@pditommaso
Apr 08 2016 16:58 UTC
or you would like to delete intermediate process outputs ?
Jason Byars
@jbyars
Apr 08 2016 16:59 UTC
both actually, it should be simple enough to solve for when a output is no longer needed as an input for anything.
the most helpful delete though when debugging would be an option to clear everything out of work, except the chain of folder relevant to the last run
Paolo Di Tommaso
@pditommaso
Apr 08 2016 17:00 UTC
Yes, you are right. However nextflow keeps intermediate outputs to resume the pipeline execution when needed
if you delete them, resume can't work any more
Jason Byars
@jbyars
Apr 08 2016 17:02 UTC
yes, those should be preserved. I'm talking about the earlier intermediates that are no longer needed. I.E. for process X's outputs to be deleted, all dependencies must have succeeded.
granted you wouldn't be able to add a new dependency for a resume in this case.
do people often add dependencies when doing a debugging resume?
Paolo Di Tommaso
@pditommaso
Apr 08 2016 17:07 UTC
um, when you resume a pipeline nextflow try to re-execute all intermediate steps, thus if you delete any intermediate process outputs it will launch that process again
however could have sense to provide it as an option warning the user
Jason Byars
@jbyars
Apr 08 2016 17:08 UTC
yes, you've mentioned. It wouldn't be as simple as deleting unneeded intermediates. The knowledge that you no longer need them would need to be preserved between runs.
Paolo Di Tommaso
@pditommaso
Apr 08 2016 17:09 UTC
this would be quite simple and useful, I agree
Jason Byars
@jbyars
Apr 08 2016 17:12 UTC
can work/tmp always be deleted between resumes? When I tried it, the resume behavior always appeared rational
Paolo Di Tommaso
@pditommaso
Apr 08 2016 17:14 UTC
if you delete the work path the resume won't work
Jason Byars
@jbyars
Apr 08 2016 17:16 UTC
not work, the tmp folder tree that is created under work.
I was playing around with the scratch option trying to mitigate the size of work/tmp, but without success
Paolo Di Tommaso
@pditommaso
Apr 08 2016 17:20 UTC
ah ok, but yes, that is created only when enabling scratch option
in that case you can completely delete the tmp scratch folder
I think in that case would be enough to add the in the config file the following setting
process.afterScript = 'rm -rf {*,.*}'
Jason Byars
@jbyars
Apr 08 2016 17:23 UTC
great, that will help a lot, or I can just kill it in workflow.onComplete. BTW, if I do scratch false the tmp folder should not be created right?
Paolo Di Tommaso
@pditommaso
Apr 08 2016 17:24 UTC
in the afterScript would be better because it would delete then during the pipeline execution, not at the end
if I do scratch false the tmp folder should not be created right?
yes
Jason Byars
@jbyars
Apr 08 2016 17:26 UTC
excellent, that clarifies a several things. Let me go test a bit. Thanks!
Paolo Di Tommaso
@pditommaso
Apr 08 2016 17:26 UTC
ok, great