These are chat archives for nextflow-io/nextflow

21st
Jun 2016
Mike Smoot
@mes5k
Jun 21 2016 20:15
@pditommaso Hi Paolo, I'm wondering if there are any limitations on the number of jobs that nextflow can handle? Can it handle a channel that generates 10^6 values that will trigger 10^6 processes to run?
Evan Floden
@evanfloden
Jun 21 2016 20:22
@mes5k Paolo might be a bit hard to contact this week as he is away. I have done some runs with over 10,000 tasks per process and had no problems at all so my guess would be you should be okay.
Mike Smoot
@mes5k
Jun 21 2016 20:29
Thanks Evan, glad to hear that it should scale. Maybe I'll run a few tests and see what happens.
Paolo Di Tommaso
@pditommaso
Jun 21 2016 20:48
I don't think that magnitude of task is problem.
In the worst scenario you can increase the mem of the JVM
However it may depend also by the underlying execution platform and the number of nodes (the more is the better)
Mike Smoot
@mes5k
Jun 21 2016 20:56
Thanks Paolo. I'm assuming that queueSize will be much smaller than the number of eventual processes and I want to be sure nextflow can hang out with 10^6 jobs waiting to be executed. It sounds like if JVM memory is the limiting factor, then we should be fine.
Paolo Di Tommaso
@pditommaso
Jun 21 2016 21:04
Ok but you won't launch a million jobs altogether, the streaming nature of nextflow will submit them one after another. Thus it also depends if the underlying executor can keep the pace.
Mike Smoot
@mes5k
Jun 21 2016 21:20
Actually, I could see a channel being populated fairly quickly with a million values (fasta file names in my case) and then needing to run blast on each. Knowing that blast will be slow and limited by the executor (most likely Ignite) I just want to make sure the main nextflow process won't crash with such a large channel sitting in memory. The reason I asked is that our group has run into queue size limitations on older clusters and has jumped through all kinds of hoops to deal with this. I'd like to avoid as much of that as possible and write clean, simple pipelines. Just trying to avoid lots of horrible old "optimizations" in our code base. :)
Paolo Di Tommaso
@pditommaso
Jun 21 2016 21:25
Honestly I don't have news for a such problem, but it's not an 100% guarantee that it will work ;)
Anyway if could report the result of your run would be surely useful for the community
Mike Smoot
@mes5k
Jun 21 2016 21:29
Sounds good. It'll be a while until we run on a large dataset, but once we do I'll report back how it goes!
Paolo Di Tommaso
@pditommaso
Jun 21 2016 21:43
Good, also because if they are just file names they should not take so much memory. Thus handling a million should not be such a big problem.