These are chat archives for dereneaton/ipyrad

10th
Mar 2017
mtcthome
@mtcthome
Mar 10 2017 12:29
@isaacovercast Hi Isaac! I'm switching from pyrad to ipyrad and I'm looking for a way to get something similar to the .alleles file, which is not implemented in ipyrad yet. I read in the pyrad google group that it's tricky to get reliable phased alleles for some data types, but for single end RAD data it should be ok. Is that right? If so, is there a script to convert from .loci to .alleles? Thanks!!!
draheem
@draheem
Mar 10 2017 15:31
I am working with a RADseq dataset (Illumina Miseq R1 reads, maximum length 242 or 251 bp, beginning with the 6-base SBf1 restriction overhang of TGCAGG). The dataset comprises 30 samples in a single genus and I am using ipyrad generated phylip files to reconstruct a series of phylogenies. I have just done a trail run using ipyrad v. 0.6.10 and it all ran smoothly and fast – many thanks. I am planning to use clustering thresholds of 0.85, 0.9 and possibly 0.95. I would like to know how to decide on the values for the following assembly parameters :
1) Param. 9. max_low_qual_bases
2) Param. 17. filter_min_trim_len
3) Param. 19. max_Ns_consens
and 4) Param. 20. max_Hs_consens
Deren Eaton
@dereneaton
Mar 10 2017 21:17
@draheem
(1) you will not want to allow too many Ns in your reads, especially at 95% clustering threshold, as it can lead to identical read copies not clustering together due to N differences. Since your reads are quite long you could probably increase this from its default value.
(2) With 250bp reads I expect that you will have some reads trimmed to shorter lengths due to decreased quality at the 3' end (if you have filtering turned on; e.g., "filter_adapters = 1"). The default minimum length is 35. If you want to enforce a longer minimum length you can increase it here.
(3) Similarly, if you have low depth data you may have many sites with poor base calls (Ns) and these will affect across-sample clustering, so you can limit the max number of Ns here. NB: this only counts internal Ns in consensus reads, since terminal Ns will be trimmed off.
(4) A high number of heterozygous sites within a consensus read may be a sign that you clustered paralogs or repetitive regions together. Since your reads are quite long you may want to increase this from the default setting.