These are chat archives for dereneaton/ipyrad

18th
Apr 2017
Edgardo M. Ortiz
@edgardomortiz
Apr 18 2017 07:11
Hi @isaacovercast , I was checking the gphocs format in the latest update, the first taxon name of each locus is now correct, however the first letter of each name is lost:
Isaac Overcast
@isaacovercast
Apr 18 2017 07:12
@edgardomortiz max_alleles_consens is used in steps 4 and 7. In step 4 it's really only used to determine whether the data are haploid for accurate heterozygosity estimation. In step 7 it's actually filtering on the number of alleles in the way you'd expect. It's possible that the behavior of this parameter was different in pyrad v3, which would explain the behavior you're seeing. It's important to distinguish between max alleles per cluster w/in a sample and max alleles per locus among samples. ipyrad applies this filter only to alleles per locus among samples at step 7. Does that help?
Edgardo M. Ortiz
@edgardomortiz
Apr 18 2017 07:12
2017-04-18 02.10.42 am.png
Isaac Overcast
@isaacovercast
Apr 18 2017 07:13
@edgardomortiz dang i thought i fixed that. I'll look again.
@mossmatters if you re-run step 3 with the -d flag it will attempt to cluster the samples that failed and will also enable debug output. After you do this email me the ipyrad_log.txt file. I'll pm you my email address...
Edgardo M. Ortiz
@edgardomortiz
Apr 18 2017 07:16
@isaacovercast just to clarify, the consensus sequences after step 5 include ALL the within clusters (diploid if you set maxalleles=2 and n-ploid as well) and all get clustered in step 6?
Isaac Overcast
@isaacovercast
Apr 18 2017 07:21
@edgardomortiz let me doublecheck this real quick..
Isaac Overcast
@isaacovercast
Apr 18 2017 07:34
@edgardomortiz I fixed the gphocs bug. Pushing a new version of the conda package v.0.6.16
Edgardo M. Ortiz
@edgardomortiz
Apr 18 2017 07:38
Thanks!, @isaacovercast in case the maxalleles filter just happens at step 7 now with ipyrad, would there be a way to implement the old pyrad behavior (add a maxalleles filter at step 5)? I am working with several species with unknown ploidy levels, so this new way of filtering the loci would get rid of many useful ones just for having a single species with a polyploid consensus sequence instead of just eliminating that particular sequence before clustering in step 6
Isaac Overcast
@isaacovercast
Apr 18 2017 07:51
@edgardomortiz Mmm, i looked over the code to refresh my memory of step 5. Right now step 5 (as with the rest of the pipeline) is really diploid-centric. There is some baseline consideration for haploid data (in step 4) but I haven't tested whether the full pipeline produces reasonable results for haploid. As for polyploid we just don't handle this well at this point. In step 5 polyploid data will essentially get downsampled to 2 alleles.
Edgardo M. Ortiz
@edgardomortiz
Apr 18 2017 07:55
@isaacovercast and I was re-reading the docs for parameter 18: "At this setting any locus which has a sample with more than 2 alleles detected will be excluded/filtered out", so it seems these should not be getting clustered in step 6. Am I correct to assume that the way in which the filter works at step 7 is to look at a locus site by site and if a site has more than maxalleles, the locus is excluded?
Isaac Overcast
@isaacovercast
Apr 18 2017 09:26
@edgardomortiz @dereneaton I haven't spent much time in this part of the code so maybe i'm making a mess of explaining this. In step 5 base calls are made under the assumption of diploidy and the number of haplotypes per locus is recorded. In step 7 reads with > max_alleles_consens haplotypes are filtered out. Deren, can you chime in here and let me know if i got this right?
Deren Eaton
@dereneaton
Apr 18 2017 14:36
@edgardomortiz @isaacovercast In step 5 base calls are made with a diploid model using the parameters estimated in step 4. The only special case in when max_alleles_consens=1, in which case the step 4 heterozygosity estimate will be fixed to zero and the error rate will suck up all of the variation within sites, and then the step 5 base calls will be haploid calls. For all other values of max_alleles_consens base calls are made using the diploid model using the H and E values estimated in step 4. After site base calls are made ipyrad then counts the number of alleles in each cluster. Like Edgardo said, this value is now simply stored in step 5 for use later in step 7 to filter loci, under the assumption that if a locus has paralogs in one sample then it probably has them in other samples but there just wasn't enough variation to detect them. We don't currently have a way to implement the filter in step 5 instead of step 7 (like it used to be done in pyrad), but we could make that an option.
Edgardo M. Ortiz
@edgardomortiz
Apr 18 2017 16:24
@dereneaton @isaacovercast being able to filter out consensus sequences from clusters that have more alleles than max_alleles_consens at step 5 also would be a great option, specially when combining different species not so closely related when I couldn't assume that what is paralog in only one species will also be in the rest, in fact this is leaving me with just a small fraction of the loci, pretty much invariant in most cases. I know ddRAD is not the best for phylogeny among species, but this option would help to recover more loci from this kind of data at least, I think. It would also help to identify problematic samples that have too many "polyploid" consensus sequences too.
Edgardo M. Ortiz
@edgardomortiz
Apr 18 2017 16:52
@dereneaton @isaacovercast I guess, if the filtering happens at step 7, the stored number of alleles could alternatively be used to eliminate the sequences with too many alleles from the alignments that will go to the .loci file instead of filtering them out at step 5 (to not disrupt the pipeline too much?)
Deren Eaton
@dereneaton
Apr 18 2017 16:57
That's true, but I think if you really want to exclude those consensus reads then you perhaps want to exclude them before across-sample clustering in step 6, since they will have some (although probably small) effect on the clustering.
Edgardo M. Ortiz
@edgardomortiz
Apr 18 2017 17:05
Yes, my solution would imply at least realigning the loci from which a sequence was removed and making changes to the database(?)
Deren Eaton
@dereneaton
Apr 18 2017 17:48
yep