These are chat archives for dereneaton/ipyrad

18th
Apr 2018
Todd Pierson
@twpierson
Apr 18 2018 01:47
@isaacovercast @dereneaton : did y'all ever solve the problem occurring with pops files (referenced here: dereneaton/ipyrad#278 )? I'm having similar errors in similar circumstances.
Isaac Overcast
@isaacovercast
Apr 18 2018 13:27
@twpierson Yeah.... This is still broken, it's a tricky problem. The workaround is to rerun step 6 with your subset of samples. I've been meaning to at least put in a warning to detect this problem. The fix is non-trivial.
Todd Pierson
@twpierson
Apr 18 2018 13:40
@isaacovercast : 10-4. Thanks for the update.
Amanda Haponski
@ahaponski_twitter
Apr 18 2018 14:58

@isaacovercast It's denovo. So I looked at the stats files (values are below), but I'm not sure I really understand why there would be such differences (besides program versions) when the data and parameter files are the same. The average number of loci per sample is ~1400 after clustering for 0.7.17 and ~1800 for 0.6.17. Thanks again for all of your help!!!

0.7.17:
total_prefiltered_loci 108603 0 108603
filtered_by_rm_duplicates 3397 3397 105206
filtered_by_max_indels 684 684 104522
filtered_by_max_snps 447 0 104522
filtered_by_max_shared_het 1812 1453 103069
filtered_by_min_sample 104297 101257 1812
filtered_by_max_alleles 1148 9 1803
total_filtered_loci 1803 0 1803

0.6.17
total_prefiltered_loci 109049 0 109049
filtered_by_rm_duplicates 4023 4023 105026
filtered_by_max_indels 5985 5985 99041
filtered_by_max_snps 2689 1 99040
filtered_by_max_shared_het 1757 1346 97694
filtered_by_min_sample 104352 95373 2321
filtered_by_max_alleles 1270 15 2306
total_filtered_loci 2306 0 2306

Isaac Overcast
@isaacovercast
Apr 18 2018 17:49
@ahaponski_twitter So rather than thinking in terms of counts of loci, I prefer to think in terms of fraction of total prefiltered loci. In this way the difference between the two versions is <0.5%, which is quite small. A subtle difference like this could easily arise from any number of bug fixes we applied to the 0.7 branch, so I wouldn't worry about it too much. One question I do have though is what are you using for your min_samples_locus value. As you can see this is where you're losing approximately 99% of your data. I would consider dialing this down quite a bit.
cwessinger
@cwessinger
Apr 18 2018 18:20
Random question: does anyone happen to know the maximum number of species that BPP can handle? Thanks!
Amanda Haponski
@ahaponski_twitter
Apr 18 2018 19:05
@isaacovercast I saw that most values were pretty similar. Thank you for the explanation that helps a lot!!! I typically run three different values for the min samples, 75, 50, and 25%. The one I sent was the 75%.
Dan MacGuigan
@DMacGuig_twitter
Apr 18 2018 20:22
Hi @dereneaton and @isaacovercast. I'm currently using iPyrad v0.7.23 to assemble a rather large dataset containing 480 ddRAD samples. I know that databasing in step 6 is a bottleneck right now. For my dataset, the predicted databasing runtime using 8 cores and 40 GB of RAM is ~10 days. Do you have any suggestions to help speed things up? Would allocating more RAM or more cores help improve runtimes? Also, is checkpointing implemented for the databasing step? Thanks for your help!