Can you think of a decent filter along those lines for a multi-sample vcf?
I'm still stuck on the problem of trying to determine reliably called vs uncalled for each sample at each loci. I could always fallback to a simple depth filter, but it seems like there must be something better... Though now that I look at it, DP, RO, and QR seem to be the only FMT values available to do genotype filtering on.
PS: There are two use-cases here. One is trying to generate per-sample sequences (fasta) without defaulting to the ref. The second (more interesting one) is computing absolute diversity and divergence metrics, which is where joint calling is also really useful.
@ekg or anyone else. I'm calling SNPs and InDels using freebayes on amplicon (targeted sequecing) data. We know the region where polymorphic events should occur. freebayes call SNP event nicely, but completely misses insertion event (we know this from Sanger seq and IGV). I have 200k reads covering that short region ( ~300 bases). I though that, maybe, such high coverage could be the problem and so I tried to subsample my fastq files using 0.5 and 0.1 fractions using
seqtk to see if I was right, but that didn't help - no insert was called in either of the attempts.
gatk -T HaplotypeCaller and it called both events, although it called
GA -> CG event as two separate onces. The problem with
HaplotypeCaller it seems to find more significant events elsewhere in the genome. We do have reads mapped elsewhere in the genome, but to very small numbers e.g 100-200 reads at few other regions elsewhere in the genome.
I'm just curios if you have any suggestions. Also I wasn't sure it this is the right place to hit you with the question. Let me know if you think BioStars is better for this discussion.
seqtk sample ~/lustre/raw-data/blahblah/P-C18-36_S12_L001_R1_001.fastq.gz 0.1 > P-C18-36_R1_0.1.fastqI think seqtk does random sampling
-F) option. I set it to 10 % (
-F 0.1) and an insert got called nicely. So to answer previous question it must be at least 2000 reads that support the insert. I do need to figure out best approach though, cause I have close to 50 different samples with variable coverage..should I always run it with 10% fraction? I did think to suggest to the research in future run sequencing at a lower depth. Anyway this is for me to figure out. Thanks heaps for help
GA -> CG event was always called with DP=180298 TA -> TACCTTCCGGA this event got called when I set -F 0.1, however this event is covered by more reads DP=184919
-Fswitch works. I posted more in depth description with some data in your google group (only just found out about it :) ) https://groups.google.com/forum/#!topic/freebayes/DJB1NYdcK7E
--haplotype-length 0 --min-alternate-count 1 --min-alternate-fraction 0 --pooled-continuous --report-monomorphicoptions. I'd now like to filter on allele frequency in the population (AO/RO). Is there a simple way to do this, or do I need to script a bit to pull out those values, calculate the frequency, and the filter? Thanks.
src/Makefileto point to 4.8.1. Some submodules didn't compile this way, so I added the location of 4.8.1 gcc/g++ binaries to the front of my PATH and this worked. I'd recommend trying this if you are running into gcc/g++ version issues, which are related to the -std=c++11/-std=c++0x errors