These are chat archives for dereneaton/ipyrad

15th
Apr 2018
Rebecca Tarvin
@frogsicles_twitter
Apr 15 2018 19:40
Hi @isaacovercast and @dereneaton ! I have been working through the ipyrad pipeline for a little while now on a new dataset of 307 samples with an average 2M reads/sample. Step 6 just won't finish during the allotted 48 hours. I've cut the RE sites (first 5 from R1, first 4 from R2) as well as the last 25 bases from R2, which were lower quality. I got them all through step 3 by running samples in smaller groups. For step 6, I've tried running it on a 512GB cluster and a 1TB cluster with -t 8 and -t 32. None have finished. The process goes pretty quickly until the "cluster across" step hits about 50% (around 24hr), then it seems to slow substantially and it doesn't get past 75% before timing out. Any ideas of how to speed things up? Other pertinent information: I am doing pairddrad de novo, there is no published genome but the organism is famous for having lots of repeat elements.
Deren Eaton
@dereneaton
Apr 15 2018 19:45
@frogsicles_twitter I think you'll probably have best luck using -t 8 or similar, since the slowdown may be due to memory limitations, and fewer threads will minimize that. But it sounds like you are requesting a lot of RAM. What are you using for your mindepth settings in step5? If you exclude lowdepth clusters (e.g., <5 or 10) that can greatly reduce the amount of singletons reads and make everything run quite a lot faster. For step 6 the number of reads is not so relevant as the number of consensus sequences per sample: is it ~30K or more like ~1M per sample?
Rebecca Tarvin
@frogsicles_twitter
Apr 15 2018 19:51
@dereneaton OK I can try -t 8. I used 5 for parameters 11 and 12; should I use a higher number? The number of consensus reads/sample ranges up to 70,000 but on average 33K
Deren Eaton
@dereneaton
Apr 15 2018 19:58
@frogsicles_twitter Oh ok, no 5 is reasonable, and 30-70K loci is a good amount to have. I was just checking that you weren't using mindepth=1, which for very large datasets can slow things down quite a lot. The clustering time is estimated while it is running, and so is often non-linear. If it completed 50% in 24 hours then you cannot necessarily expect that it will finish in 48 hours, but it should not be exceptionally longer than that. The difficulty is when you run into memory limitations, which will slow it down considerably, since it will start writing information to disk instead of just using RAM. If that happens then it can slow down a lot in the latter parts of the clustering. We're working on finding improvements for this step on super large datasets. For now, I think you'll just need to request a longer wall time and continue to use a large memory node.
Rebecca Tarvin
@frogsicles_twitter
Apr 15 2018 20:04
@dereneaton OK I will try to ask for more time! I had such hope it would finish! My feeling is that it does slow down substantially, so it's possible that it starts writing to the disk. Thanks for your help!