These are chat archives for dereneaton/ipyrad

11th
Mar 2017
draheem
@draheem
Mar 11 2017 17:11 UTC
@dereneaton. Many thanks. (1) I ran a series of analyses last year for the same rad data using pyrad where for 85%, 90% and 95% clustering thresholds I set max_low_quality_bases at 8, 6 and 4 respectively. Maybe follow an approach like this? (2) For filter_min_trim_len in my ipyrad trial I used setting 2 (strict). I had a quick look at some of the data before and after ipyrad steps 1-2 on FastQC. Those with some adapter content before step 1 had none after the strict filtering, so I thought might be good to stick with the strict filter. I used the default 35 bp for minimum trimmed read length because I thought I should try and save as much potentially useful data as possible.
(3) I did go through the ipyrad documentation, but am uncertain what all the column headings mean in the stats file from Step 3. Have been assuming that ave_depth_total means the average depth for the total number of clusters at the end of step 3, and that the av_depth_stat is the average depth for clusters meeting the specified minimum depth criterion (= 7 in my trial run). For my trial (85% clustering threshold, minimum depth of 7), I have an ave_depth_total ranging from about 10-35 (most values are in the 15-20 range), av_depth_stat (= av_depth_maj) from about 25-50, the clusters_hidepth ranges from about 7000-42000 (mostly between 15000-25000). The number of clusters seems low (and varies a lot between samples). My data I think is low-depth – in my pyrad analyses at clustering thresholds 85%, 90% and 95%, I varied minimum depth from 2-13 for each clustering threshold (i.e. I used 2, 5, 7, 9, 11, 13). Found that for all three clustering thresholds the length of the final assembly (i.e. number of positions) at a minimum_samples_in_a_locus of 4 declined by 70% as you move from a minimum depth of 2 to 13 (i.e. from 5 million bp to 1.5 million bp). So it looks like for my dataset the optimal minimum depth is in the 7-9 range. (4) Should max_Hs_consens be varied with respect to the clustering threshold – i.e. should the parameter be set at a higher value for a low clustering threshold (85%) and a lower value for a high clustering threshold (90%)?
Isaac Overcast
@isaacovercast
Mar 11 2017 21:21 UTC
@mtcthome People ask this question alot. I finally updated the faq in the docs to answer this, but here's deren's answer from a while ago, which is still true: "We're hoping to provide something similar eventually, the problem with the pyrad alleles file is that the alleles are only phased correctly when we enforce that reads must align almost completely, i.e., they are not staggered in their overlap. So the alleles are correct for RAD data, because the reads match up perfectly on their left side, however, staggered overlaps are common in other data sets that use very common cutters, like ezRAD and some GBS, and especially so when R1 and R2 reads merge. So we needed to change to an alternative way of coding the alleles so that we can store both phased and unphased alleles, and its just taking a while to do. So for now we are only providing unphased alleles, although we do save the estimated number of alleles for each locus. This information is kind of hidden under the hood at the moment though."