These are chat archives for nextflow-io/nextflow

27th
Mar 2018
Brad Langhorst
@bwlang
Mar 27 2018 13:00
I have a nextflow script that is aligning per-tile fastqs from a novaseq (this run is 29568 fastqs). It’s working nicely(thanks for the helpful tool!) but I’d like to make it more efficient by consuming multiple tiles per job . I see in the docs that there is a .buffer(size:NFILES) method on a channel. Is that the best way to deal with this situation? Is there a good way to consume N bytes of input instead of a fixed number of files (so that a run with lots of libraries has more tiles per alignment job than one with just a few libraries?) Maybe I sniff the fastq file size upstream and set NFILES upsteam?
Vladimir Kiselev
@wikiselev
Mar 27 2018 13:34
@pditommaso just to let you know, I am now asked to organise a Nextflow workshop at the Sanger for our programme groups, people are very keen! Great job on making it user-friendly! :clap:
Maxime Garcia
@MaxUlysse
Mar 27 2018 13:34
@wikiselev You will need stickers!!
Vladimir Kiselev
@wikiselev
Mar 27 2018 13:36
yes, it's the only missing bit in the NF portfolio ) or do they exist?
Maxime Garcia
@MaxUlysse
Mar 27 2018 13:39
They exists, you just need to meet Paolo
Vladimir Kiselev
@wikiselev
Mar 27 2018 13:41
ok, cool, he is coming here soon, I will order some from him ;-)
Maxime Garcia
@MaxUlysse
Mar 27 2018 13:43
You can ask @ewels for stickers too, he has made some for nf-core
Vladimir Kiselev
@wikiselev
Mar 27 2018 13:50
:+1:
Paolo Di Tommaso
@pditommaso
Mar 27 2018 14:19
wow nice, sure sticker exist
I will bring at Sanger next month :)
@bwlang do you want to aggregare multiple fastqs together ?
Brad Langhorst
@bwlang
Mar 27 2018 14:28
@pditommaso i’m passing them on a pipe to the aligner via seqtk mergepe so could cat multiple.
Paolo Di Tommaso
@pditommaso
Mar 27 2018 14:28
I see
Brad Langhorst
@bwlang
Mar 27 2018 14:29
e.g. seqtk mergepe <(zcat *.1.fastq.gz) <(zcat *.2.fastq.gz)
i’ll be back in an hour or so...
Paolo Di Tommaso
@pditommaso
Mar 27 2018 14:30
ok, I should be here
Brad Langhorst
@bwlang
Mar 27 2018 17:35
turned out to be a few hours… but if anyone has thought of any suggestions for how to approach processing variable size batches of files depending on file size - that would be appreciated.
Mike Smoot
@mes5k
Mar 27 2018 17:44
@bwlang I wonder if you could use buffer with a closing condition closure that emits true when some size threshold is reached? I think you'd need to define a cumulative size variable to keep track of things, but I'm guessing it would work.
Brad Langhorst
@bwlang
Mar 27 2018 17:49
hi @mes5k thanks, i’m not sure how to write such a closure - but I’ll investigate. I think it sounds a bit neater than precalculating a number of files and setting a variable to count the batch size .
could better handle non-even pools too.
Mike Smoot
@mes5k
Mar 27 2018 17:57

I just tried it:

total = 0

def check(f) {
    if ((f.size() + total) > 1000) {
        total = 0
        return true
    } else {
        total += f.size()
        return false
    }
}

Channel.fromPath('*.nf').buffer{ check(it) }.view()

Eyeballing the results suggests it works.

I've unfortunately been doing this sort of optimization a lot lately and there are other approaches. Depending on the data and how it comes to you, you might want a process that creates a bunch of directories containing the right number of correctly sized files or return a manifest file that lists collections of files. In general I've found it cleaner to write my own processes to organize things than to string together nextflow operators to do the grouping.

Brad Langhorst
@bwlang
Mar 27 2018 21:58
@mes5k thanks! - i learned a bit today!