These are chat archives for nextflow-io/nextflow

14th
Nov 2017
Simone Baffelli
@baffelli
Nov 14 2017 12:00
Good "morning". A philosophical question today: is it a desired behavior that a process cache is invalidated by the change of a certain input, it that input is not used in the process definition, that is in the script or exec directive?
Becuase in many cases I decided to pass extra information through some processes by adding more values to a set input, but I do not change the way the process is supposed to run. However, with the current behavior a "perfectly fine" process is run again instead of being taken from the cache, since the change in the input set changes the process caching key.
Paolo Di Tommaso
@pditommaso
Nov 14 2017 18:12
you are taking a PhD (Doctor of Philosophy), aren't you? :grin:
BTW, NF creates a unique key by hashing all inputs and the command script, hence as long as an input change the key change even if it's not used. In a perfect (philosopher) world it should not happen (maybe) because the idempotency of the task is guaranteed, but in a real world this would be too costly to be implemented, if possible at all
there are a lot of bioinformatics tool which assumes a certain file (extension) must exist as long as another is passed on the command line
Paolo Di Tommaso
@pditommaso
Nov 14 2017 18:18
for this tool there's no way to infer the complete list of files used by parsing the command string
Venkat Malladi
@vsmalladi
Nov 14 2017 19:34
anyone have a good example of combining a list of tsv files into one channel?
Paolo Di Tommaso
@pditommaso
Nov 14 2017 19:35
Channel.fromPath('/path/*.tsv')
                 .splitCsv(sep:'\t')
                 .set { one_ch }
?
Venkat Malladi
@vsmalladi
Nov 14 2017 19:36
@pditommaso so this example will give one channel per row?
Paolo Di Tommaso
@pditommaso
Nov 14 2017 19:37
one channel emitting each row one by one
Venkat Malladi
@vsmalladi
Nov 14 2017 19:39
okay thanks
Paolo Di Tommaso
@pditommaso
Nov 14 2017 19:39
:+1:
Eric Davis
@davisem
Nov 14 2017 21:21
I'm trying to explore the upper bounds of what is possible with Nextflow. How big and how complicated of pipelines have been successfully maintained with this tool. Preferably in an HPC environment. Would anyone care to share stories? Would this post be better for the blog?
Paolo Di Tommaso
@pditommaso
Nov 14 2017 21:23
in terms of workflow complexity this is one of the biggest one I know https://twitter.com/nextflowio/status/729700356800819200
Eric Davis
@davisem
Nov 14 2017 21:25
impressive
Paolo Di Tommaso
@pditommaso
Nov 14 2017 21:25
in terms of tasks I've run workflows running 1.5 millions of tasks without any problem
Mike Smoot
@mes5k
Nov 14 2017 21:25
I took some notes on this a while back and will see if I'm able to share them (shakes fist at lawyers)
Eric Davis
@davisem
Nov 14 2017 21:26
nice thank you!
Paolo Di Tommaso
@pditommaso
Nov 14 2017 21:26
I was going to mention you Mike
Eric Davis
@davisem
Nov 14 2017 21:26
What I would need it to run would be on the order of that graph, maybe larger.
Bewildering complexity
Of the workflow managers, Nextflow seems have more power, where the "other guys" workflow tool (won't mention names) quickly reaches its limits.
Paolo Di Tommaso
@pditommaso
Nov 14 2017 21:28
IARC maintains a list of NF pipelines they have developed
Eric Davis
@davisem
Nov 14 2017 21:28
So this is encouraging to see
Mike Smoot
@mes5k
Nov 14 2017 21:33
Speaking of performance, is there a way to limit the amount of memory nextflow can consume? -XmxMaxSomething jvm argument?
I've been getting OutOfMemoryErrors in a pipeline
Paolo Di Tommaso
@pditommaso
Nov 14 2017 21:35
the usual JVM flags
Mike Smoot
@mes5k
Nov 14 2017 21:35
And there's a nextflow ENV variable I can add this to, right?
Paolo Di Tommaso
@pditommaso
Nov 14 2017 21:36
tho OutOfMemoryErrors is likely caused by an improper operator/data use
NXF_OPTS or _JAVA_OPTIONS
Mike Smoot
@mes5k
Nov 14 2017 21:37
Yeah, we're doing a LOT with operators and we have a LOT of data. My plan is to move this work to a process.
perfect, thanks
Paolo Di Tommaso
@pditommaso
Nov 14 2017 21:39
in this case the best thing to do is to dump the JVM heap and analyse the content to troubleshoot the problem
Mike Smoot
@mes5k
Nov 14 2017 21:44
I have a pretty good idea where the problem is. I've just been handed a larger dataset than I've seen before and what used to work without problem is now failing. It takes at least day to exhaust the resources on the machine in question. One issue with the OutOfMemoryError is that it doesn't get fully reported by nextflow. workflow.success is correctly set to false and I see the error in the logs, but none of the other workflow error output gets populated. I'm not sure if it would be possible to report errors like exceptions get reported.
Paolo Di Tommaso
@pditommaso
Nov 14 2017 21:47
well, OOME should cause the JVM to stop abruptly, not sure it can be managed
however this sounds a memory leak
Mike Smoot
@mes5k
Nov 14 2017 21:48
yep, that's what I figured
I'll see if I can run some of the diagnostic tools to be sure
Paolo Di Tommaso
@pditommaso
Nov 14 2017 21:48
add the -XX:+HeapDumpOnOutOfMemoryError option, the dump should provide useful info
Mike Smoot
@mes5k
Nov 14 2017 21:49
will do!
Shawn Rynearson
@srynobio
Nov 14 2017 22:10
If at the end of processing you use the publishDir option, as opposed to storeDir as suggested, would you have to reprocess if you needed to -resume?
Paolo Di Tommaso
@pditommaso
Nov 14 2017 22:13
no
I mean the tasks are not recomputed
Shawn Rynearson
@srynobio
Nov 14 2017 22:16
right, because -resume only re-processes what you've changed in the code, files are not apart of the cache.
Paolo Di Tommaso
@pditommaso
Nov 14 2017 22:16
yep
Shawn Rynearson
@srynobio
Nov 14 2017 22:16
thanks @pditommaso
Paolo Di Tommaso
@pditommaso
Nov 14 2017 22:16
:+1: