@twpierson Step 4 performs a maximum likelihood optimization to jointly estimate the two model parameters for each sample. If you have hundreds of samples it is kind of overkill to do this separately for each sample, since the estimates will probably be very similar, and step 5 is not actually sensitive to very small differences in these estimates, but that is what ipyrad does currently. Thus, one way to speed it up would be to instead run step4 on just a subset of samples and use the mean values as the parameter estimates in step 5. I would expect the results would be nearly identically. Alternatively, we could aim to speed up each individual test, since it is kind of nice getting the value estimates for each sample. The optimization is done using scipy. I can think of three possible ways to speed it up, which we will look into: (1) lower the convergence threshold in the scipy ML optimization; (2) do not use all site patterns, since a random subset of site patterns should be sufficient to estimate these params; (3) speed up the function that scipy is optimizing (I've spent a fair bit of time on this already though). My guess for why your CPUS are running at less than 100% is because you are running ipyrad with 4 processes on a 4-core computer, and so other background processes on your computer are using a bit of CPU power that is taking away from the fourth process. If you ran ipyrad with
-c 3 it should run the three processes at 100%. But I would guess that it probably will still run faster at
-c 4 even if not all at 100%.