These are chat archives for nextflow-io/nextflow

11th
Oct 2018
Oliver Schwengers
@oschwengers
Oct 11 2018 09:58
Hi Paolo, thanks a lot for developing NF and the great support! I recently coded a pipeline (in short: multi fasta -> map -> flatmap -> process (prodigal+blast) -> map -> groupTuple -> map -> sort -> collectFile). It's running locally on a machine with 64 cores, no special configuration. The flatmap step emits ~210.000 items. For ~1h everything works perfect but then suddenly, the local executor seems to only run 5 process in parallel. This went on for about 18h and suddenly without any obvious reason, it continued working with the expected 64 processes in parallel. I had a glimpse into the VM via VisualVM and I could figure out, that during the slow processing stage only 5 reaper-threads have been active. Before and after there are a lot more. Is this on purpose or could it be, that there is some (to me) strange thread behavior which might lead to this performance drop down. Logs are completely fine, nothing odd here... Unfortunately, I currently dont have the time to setup a reproducible workflow. But I could reproduce it on a second machine with the same specs.
Paolo Di Tommaso
@pditommaso
Oct 11 2018 10:04
you may need to better tune the JVM memory and garbage collector, VisualVM should provide information about the GC activity
Oliver Schwengers
@oschwengers
Oct 11 2018 10:16
I checked this in the beginning. First I had serious issues with the GC activity (GC overhead exceptions), since then I provide min 128 Gb / max 300 GB memory via NXF_OPTS="-Xms128G -Xmx300G". This helped to solve the GC issue. I checked it via VisualVM. this is not due to a low memory/GC issue. The GC runs once in a while, does it job and goes back to sleep as expected. During the GC being active, the cpu loads spikes to 3000 % for a couple of seconds and the everysthing is fine
Paolo Di Tommaso
@pditommaso
Oct 11 2018 10:20
min 128 Gb / max 300 GB memory
is a lot
are you using any splitXxx operator?
Oliver Schwengers
@oschwengers
Oct 11 2018 10:22
yes, splitFasta. It does a lot of in memory processing, so the huge mem consumption is somewhat ok and at least to me not a problem. 100 Gb would be enough but as a test I wanted to keep the GC from taking too much cpu
Paolo Di Tommaso
@pditommaso
Oct 11 2018 10:22
have you checked the yellow warning box here ?
Oliver Schwengers
@oschwengers
Oct 11 2018 10:24
yep, I split them in mem and use the record objects. The high-mem task are native analysis-specific code afterwards, though
from each fasta sequence I generate random subsequences which I pass over via flatMap to the subsequent process (prodigal+blast)
Paolo Di Tommaso
@pditommaso
Oct 11 2018 10:30
I think that it's wasting cpu time somewhere, try to investigate that profiling jvm activity
Oliver Schwengers
@oschwengers
Oct 11 2018 10:37
The actual JVM process runs with exact 100% load. So the is no internal multithreaded execution, wether GC or actual code. All I could figure out from JVM inspection is, that A) during the "low performance stage" significantly less "reaper threads" exist and B) a lot cpu time is being spent by cpu() and mem() methods. I do not exaclty know, what the reaper threads are doing but to it seems that NF cannot clean up the completed tasks an by this freeing the local queue for new tasks. But thats just a guess. Maybe you have an idea or comment about this?
Paolo Di Tommaso
@pditommaso
Oct 11 2018 11:22
do visualvm allows the export of this info so I can give a look ?
Oliver Schwengers
@oschwengers
Oct 11 2018 12:25
There are some options... I'll check it
Oliver Schwengers
@oschwengers
Oct 11 2018 13:00
ok, now I have 3 application snapshots from VisualVM.. They show there is a changing number of "process reaper" threads. When their number is high, NF works with the expected parallalism of 64 local process instances, but after a while the number of process reaper threads suddenly goes down just like the parallalism of local process instances.
Paolo Di Tommaso
@pditommaso
Oct 11 2018 13:03
are you only using local executor ?
Oliver Schwengers
@oschwengers
Oct 11 2018 13:03
Additionally, while normal execution the actual JVM load is significantly below 100 % (50 -70 %) but in low performance/parallelism stages, the loads is constantly 100%. So maybe there are some internal bottlenecks withholding NF from firing up new local instances.
jep, the whole pipeline is strictly running locally. local data, local executor. no NFS storage, no cluster
Paolo Di Tommaso
@pditommaso
Oct 11 2018 13:04
can you copy and paste these thread dumps here ?
also what version are you using ?
Oliver Schwengers
@oschwengers
Oct 11 2018 13:04
it's huge, 1,800 lines
version 32
Paolo Di Tommaso
@pditommaso
Oct 11 2018 13:04
pastebin ?
Oliver Schwengers
@oschwengers
Oct 11 2018 13:05
$ nextflow -v
nextflow version 0.32.0.4897
Paolo Di Tommaso
@pditommaso
Oct 11 2018 13:05
:+1:
Oliver Schwengers
@oschwengers
Oct 11 2018 13:06
Maxime Vallée
@valleem
Oct 11 2018 13:24
Hello NF community! I have a channel holding files named cohort.chr1.vcf, cohort.chr2.vcf, cohort.chr3.vcf, ...,cohort.chr10.vcf, cohort.chr11.vcf,...,cohort.chr22.vcf,cohort.chrX.vcf, cohort.chrY.vcf. I would like to sort (like a Unix version sort) my channel on the filename. I have managed to sort it but I only get chr1, chr10, chr11... (you know the deal). Any tip to deal with it elegantly?
Paolo Di Tommaso
@pditommaso
Oct 11 2018 13:53
@oschwengers can you upload also the nf log file ?
@valleem something like .sort { it.name.tokenize('.')[1].substring(3).toInteger() }
Oliver Schwengers
@oschwengers
Oct 11 2018 14:06
@pditommaso , it's 60 Mb.
I could send you a Gzip via Google mail?
it's 6.3 Mb
Luca Cozzuto
@lucacozzuto
Oct 11 2018 14:26
Hi, is there a way to use the map operator on a variable number of entries? i.e.
channel.map{
    sample_id, internal_code, analysis_type, checksum, timestamp, filename, * -> sample_id, internal_code, analysis_type, checksum, timestamp, filename, [*]
}
Maxime Vallée
@valleem
Oct 11 2018 14:29

@pditommaso I am close to it with tuples, attaching the chromosome name to the file :

vcf_hf_ch
    .map { chr, file -> tuple(chr.replace("chr",""), file) }
    .toSortedList( { a, b -> a[0] <=> b[0] } )
    .println()

However, it is doing the 1,10,11,...,19,2,20,21,22,3,4,5,...,9,X,Y sort. I tried .map { chr, file -> tuple(chr.replace("chr","").toInteger(), file) } but it did not like those X and Y chromosomes.

This person has the answer, but I am not sure how to articulate it in NF
Paolo Di Tommaso
@pditommaso
Oct 11 2018 14:47
@oschwengers a dropbox/gdrive link here?
it's a snapshot of the currently running workflow
please shortly wave hands when you downloaded it, so I can delete it...
Luca Cozzuto
@lucacozzuto
Oct 11 2018 15:24
no one can help me? :)
Paolo Di Tommaso
@pditommaso
Oct 11 2018 15:35
@oschwengers downloaded
@lucacozzuto depend how much you pay :)
replace sample_id, internal_code, analysis_type, checksum, timestamp, filename, * with it
Luca Cozzuto
@lucacozzuto
Oct 11 2018 15:42
A = Channel.from([1,"A1"], [2,"B1"]);
B = Channel.from([1,"A2"], [2,"B2"]);

J=A
    .join( B )

J.map{ it -> a,[*] }
     .println()
Paolo Di Tommaso
@pditommaso
Oct 11 2018 15:42
WHAT THE HELL IS THIS [*] ?
Luca Cozzuto
@lucacozzuto
Oct 11 2018 15:43
I like it
Paolo Di Tommaso
@pditommaso
Oct 11 2018 15:43
programming lang have their own syntax not the one you pretend they have !
Mike Smoot
@mes5k
Oct 11 2018 15:43
it's almost pythonic
Luca Cozzuto
@lucacozzuto
Oct 11 2018 15:44
@pditommaso it's not real code
Paolo Di Tommaso
@pditommaso
Oct 11 2018 15:44
no Python or GO in this channel !! :joy:
Luca Cozzuto
@lucacozzuto
Oct 11 2018 15:44
I'm telling you I woul like to collect the rest of the things
so this what I have
[1, A1, A2, A..n]
[2, B1, B2, A..n]
and this what I would like to have
Paolo Di Tommaso
@pditommaso
Oct 11 2018 15:45
it is an list therefore you can manipulate as needed
Luca Cozzuto
@lucacozzuto
Oct 11 2018 15:45
[1, [A1, A2, A..n]]
[2, [B1, B2, B..n]]
considering that A1, A2 can be of variable numbers
Paolo Di Tommaso
@pditommaso
Oct 11 2018 15:46
J.map{ it -> def l = ['foo']; l.addAll(it); return l }
Luca Cozzuto
@lucacozzuto
Oct 11 2018 15:46
and please do not complain with A..n :)
[foo, 1, A1, A2]
[foo, 2, B1, B2]
?
Paolo Di Tommaso
@pditommaso
Oct 11 2018 15:51
google for groovy / java list guide for dummies
Luca Cozzuto
@lucacozzuto
Oct 11 2018 16:09
J.map{ it -> def l = [it[0]]; l.addAll([it.drop(1)]); return l }.println()
[1, [A1, A2]]
[2, [B1, B2]]
Tobias "Tobi" Schraink
@tobsecret
Oct 11 2018 22:43
if I have a file object that can contain one or more files, how can I get the number of files in it?
myfile = file(bamfiles) //can be one or more files
x = number of files in bamfiles?
The reason I need this is because I want to dynamically scale the memory allotted to a process depending on how many inputs it has
Tobias "Tobi" Schraink
@tobsecret
Oct 11 2018 22:58
Dug a little bit and it seems I can use bamfiles.size()
but this throws an error when I have only one file in bamfiles
Rad Suchecki
@rsuchecki
Oct 11 2018 23:41
This message was deleted
Rad Suchecki
@rsuchecki
Oct 11 2018 23:53
Where does object come from? @tobsecret