Hi folks. I'm surprised to see that step4 is taking much longer than step3, when I expected the opposite. I've run through a few datasets fully, and everything is working smoothly (i.e., I'm not stalling at step4), except that this step is oddly slow. Do you have any guesses as to why that might be? Perhaps the way that memory is allocated for that step?
@twpierson What does this dataset look like? How many samples? What datatype? You are right that normally step 4 should be much faster than step 3. I'm not sure what could be causing this. How long does step 3 take and how long does step 4 take? How many cores and how much ram are you running with?
This has held true for a few datasets of varying size, ranging from 10-120 samples (mostly in the ballpark of 500k to 2 million reads/samples) run as pairddrad. Right now, I'm doing this locally on a Macbook with 4 cores and 16GB RAM. I can't remember exactly how long each step took for the datasets that I've finished. Is this stored in a log file somewhere?
We don't store runtimes. Might be a nice feature though.
The only thing I can really think is that 16GB is lower than I normally recommend in terms of RAM. It could be (wild speculation) that step 4 is slowing down because it's maxing the ram and swapping to disk a whole bunch. I've actually definitely seen this behavior on limited ram systems, but only on step 6, which is MUCH more memory intensive.
Still, I would think this would impact step3 in just the same way, if not worse. You might try running step 4 again and monitoring the output of top. You should see 4 python processes running. You can see how much CPU they're using, and get a sense if they are maxing out the memory... I'll have to think about what else could be happening.