These are chat archives for dereneaton/ipyrad

7th
Dec 2015
Isaac Overcast
@isaacovercast
Dec 07 2015 01:34
Not exactly. bedtools merge finds all overlapping sequences within genomic regions. Then samtools pileup outputs alignments of all reads within each specific region, it doesn't call consensus sequences (it can, but it sucks at it). We could use bcftools (post pileup) to call snps within each stack, but i was thinking this may not be ideal because then we'd be using two different methods to call snps and that may introduce some weird bias. In this way the output of pileup is more like the output of muscle, alignments across all reads in a region.
I'll update the pyrad refseq mapping docs to make the whole process make more sense. Gimme a few minutes....
Deren Eaton
@dereneaton
Dec 07 2015 01:37
OK, that sounds great. I agree it's good that it would be consistent. I worked no making faster consensus calls today.
Isaac Overcast
@isaacovercast
Dec 07 2015 01:39
In terms of using external tools to do base calling for unmapped reads, i'm not sure that'd work. bcftools would be the best route (outside picard/gatk, which introduces a whole other nightmare of external dependencies). bcftools needs reference seq positions to do the magic, we could kludge it, but i'm not sure it's worthwhile. Better to bring all mapped/unmapped reads back into the pipeline at step4 and focus on making our shit work better, i think.
Isaac Overcast
@isaacovercast
Dec 07 2015 16:58
Slight digression... I can see cluster_within doing some gymnastics to account for the peculiarities of GBS data. Given that the results are messier than rad-seq, people use gbs because it's cheaper right? In your experience is there a difference in coverage between these methods? Just curious...
Deren Eaton
@dereneaton
Dec 07 2015 16:58
Yeah, hella difference
gbs data tend to have super variable coverage, with many more singletons
I'm actually cleaning up cluster_within right now, and fixing up the alignment.
I improved the edge trimming in step2 significantly, so I think a lot of the gbs messyness doesn't need to be checked again in step3 anymore.
I've been testing this with a really messy empirical paired gbs data set of mine