These are chat archives for nextflow-io/nextflow

27th
Jun 2016
Mike Smoot
@mes5k
Jun 27 2016 15:37
Hi everyone, I'm curious how nextflow decides whether or not to re-run a process when, after a crash, nextflow has resumed? I just came in to a pipeline that failed over the weekend and when I resumed the pipeline it started re-running a bunch of processes that had succeeded. Since these processes took a long time to complete the first time, I really don't want to re-run them again, so I'd like to diagnose what nextflow thinks changed so that I can prevent this in the future. As far as I can tell neither the nextflow code nor the data have changed. What else does nextflow look for?
Paolo Di Tommaso
@pditommaso
Jun 27 2016 15:42
Hi, nextflow creates a 128bit hash key given the inputs and process scripts. That key defines the task working folder.
Thus, when trying to execute a process if a folder for that key already exists and contains a zero exist status file and all the expected outputs, the process execution is skipped, otherwise it's just executed (again)
Mike Smoot
@mes5k
Jun 27 2016 15:45
that makes sense. Is there any way that I can print out or see what nextflow is using to make the hash?
Looking at my output, I see a .exitcode of 0 and all of the expected output files. I do see that .command.err has some stuff in it - would that trigger a rebuild?
Paolo Di Tommaso
@pditommaso
Jun 27 2016 15:51
no, .command.err is not taken in consideration
this is the relevant code
you can activate the log with using the following cmd line
nextflow -trace nextflow.processor.TaskProcessor run ... etc
is you process script accessing some global variable?
Mike Smoot
@mes5k
Jun 27 2016 16:05
No, I'm accessing any global variables.
Paolo Di Tommaso
@pditommaso
Jun 27 2016 16:07
what shared file system are u using ?
Mike Smoot
@mes5k
Jun 27 2016 16:08
I'm reading some input data from nfs, but the work dir is local and I'm running with the local executor.
Paolo Di Tommaso
@pditommaso
Jun 27 2016 16:10
I've noticed that NFS may not provide consistent file timestamp, invalidating the cache. But not sure that it's your problem.
you may want to try to debug the execution with that flag to understand what is producing a different hash number
Mike Smoot
@mes5k
Jun 27 2016 16:11
Yeah, I'm looking through the trace output now to see if anything jumps out.
Mike Smoot
@mes5k
Jun 27 2016 16:28

Well, I've found the value that's hashing differently. This is the first time:

  73a5ce311a22d83f9cdf7b743bde6695 [java.lang.String] fasta 
  865297d03af1de750eb184a32321b20a [nextflow.util.ArrayBag] [FileHolder(sourceObj:/home/msmoot/code/nextflow_eukaryotic_annotation/work/we03730_nuc_v150325_PM_chr_1__4.fa, storePath:/home/msmoot/code/nextflow_eukaryotic_annotation/work/we03730_nuc_v150325_PM_chr_1__4.fa, stageName:we03730_nuc_v150325_PM_chr_1__4.fa)]

and this is the second time:

  73a5ce311a22d83f9cdf7b743bde6695 [java.lang.String] fasta 
  899140493fd035290b50d4623bd9564a [nextflow.util.ArrayBag] [FileHolder(sourceObj:/home/msmoot/code/nextflow_eukaryotic_annotation/work/we03730_nuc_v150325_PM_chr_1__4.fa, storePath:/home/msmoot/code/nextflow_eukaryotic_annotation/work/we03730_nuc_v150325_PM_chr_1__4.fa, stageName:we03730_nuc_v150325_PM_chr_1__4.fa)]
Not sure what the difference is.
Wait, I figured it out - the fasta file referenced is getting re-written as part of the pipeline, so it has a different timestamp.
Paolo Di Tommaso
@pditommaso
Jun 27 2016 16:32
are you sure that file do not change?
ahhh
different timestamp -> different hash
so it's coherent ?
Mike Smoot
@mes5k
Jun 27 2016 16:33
Yes, the hashing is correct, but now I need to figure out a way to not re-write that file each time.
This is happening as part of the splitFasta and subsequent operators.
Paolo Di Tommaso
@pditommaso
Jun 27 2016 16:34
does the content change?
Mike Smoot
@mes5k
Jun 27 2016 16:34
I don't think so, but I'll confirm
Paolo Di Tommaso
@pditommaso
Jun 27 2016 16:35
if just a problem of timestamp you can use cache 'deep'
anyway splitFasta should be coherent with the resume
i.e. not creating new files but reusing existing chunks
Mike Smoot
@mes5k
Jun 27 2016 16:39
I'm using a custom splitting function (to get one file per record), but in my brief test, the files appear the same. That said, I think cache 'deep' is exactly what I want.
Mike Smoot
@mes5k
Jun 27 2016 16:45
Thanks for your help Paolo!
Paolo Di Tommaso
@pditommaso
Jun 27 2016 16:48
Ah yes, now I remember your use case. Yes that's the pb. Cache deep should fix it.
Welcome
Mike Smoot
@mes5k
Jun 27 2016 16:48
One more question: the root of my problem is that I'm writing the fasta files as part of a chain of operators, which I gather run for every invocation of nextflow, regardless of whether the underlying data has changed. If I were to put the last step where I actually write the file into a process, then I'd guess this problem would go away, correct?
Paolo Di Tommaso
@pditommaso
Jun 27 2016 16:51
I were to put the last step where I actually write the file into a process
What do you mean exactly?
Mike Smoot
@mes5k
Jun 27 2016 16:56

Here's my current chain of operators:

fasta_files = fasta_scaffolds_1
    .splitFasta( record: [id: true, seqString: true] )
    // convert the long seqString into a list of shorter strings
    .map{ rec -> tuple( rec.id, splitSeq(rec.seqString, [], fastaLimit) ) }
    // flatten list of seqStrings into tuples of id__index, and subseq
    .flatMap{ id, seq_list -> seq_list.withIndex().collect{ seq, ind -> tuple(id,"${id}__${ind}",seq) } }
    // write fasta files for each subseq
    .map{ contig_id, seq_id, seq -> tuple( contig_id, seq_id, writeFasta(seq_id, seq) ) }

I'm thinking that if I remove that last map where I call the function writeFasta and put it into a process then the file will be written as part of a process and properly cached.

Paolo Di Tommaso
@pditommaso
Jun 27 2016 16:58
I see, u can do that
Actually nextflow fo that implicitly as long as you define the input as a file
But if I remember well you have many big chunks, so u may risk to saturate the mem
Mike Smoot
@mes5k
Jun 27 2016 17:02
Cool. I'm going to give it a try and see where I get. cache 'deep' should help in any case. Thanks!
Paolo Di Tommaso
@pditommaso
Jun 27 2016 17:03
:+1: