Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
    Patrik Medstrand
    Hello - I have a question regarding an output-file from V-pipe. What does the headings in the snvs.cvs file stand for? Example of heading and first line: Chromosome Pos Ref Var Frq1 Frq2 Frq3 Pst1 Pst2 Pst3 Fvar Rvar Ftot Rtot Pval Qval
    NC_045512.2 241 T C 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 567 649 567 649 1 1 - Specifically what is Frq1, 2 etc, and Pst1, 2 - Can somebody help? Thanks
    Mahmut Uludag

    Patrik, I'm outsider to the project who is someone that wants to learn, I have noticed that there are information for similar columns in the vcf files generated by V-pipe, please check your V-pipe output folders for vcf files and check the INFO lines at the beginning of the vcf files.

    It looks some of the VCF and the CSV files are generated using this code: https://github.com/cbg-ethz/shorah/blob/master/src/shorah/shorah_snv.py, please see lines from 440 to 472 where the VCF INFO lines are processed


    Hi, for the columns of the CSV file:


    The trivial part:

    • Chromosome: name of the reference sequence used
    • Pos: position on the reference sequence
    • Ref: what the reference sequence had
    • Var: variant found by shorah

    The specific ShoRAH scoring part:

    ...1/2/3 :
    ShoRAH in shotgun mode splits the genome in overlapping regions (see parameters -w window size and -s shifts between windows). For each variant called during the SNV phase, shorah looks at three overlapping window from the diri_sampler phase to confirm the variant. At least two windows are required to call a variant.

    Frq...: frequency at which the variant got observed in that window.
    (fractional number in the 0.0 - 1.0 range)
    Pst...: posterior probability of the variant in that window.
    ( == equivalent to the haplotype's posterior in "support/")

    F/R...: the fil step works by comparing the proportion between the
    forward and reverse mapped reads and tries to look for a bias toward strands in one direction.
    (In a paired-ends double strand sequencing, a real SNV should be roughly seen in similar proportion in both direction. Whereas a sequencing error could appear in one direction but would be completely missing out of the reverse direction).

    ...var: number of ocurence of the variant in each respective direction
    ...tot: total number of reads at that position.

    Pval: is the P-value for the question "is there a strand bias?"

    • on analyses that are performed with paired ends, double strands, the p-value should be as high as possible (null hypothesis is retained: there is no bias, most the observed difference could be observed by randomness)
    • there are special sequencing protocols where we are only observing one strand the p-value should be low (null hypothesis is rejected: there is bias, we are mostly observing the variants on one single strand, the one which was targeted by the special protocol) (based on experience with some weird protocols used in Geneva Genomics Laboratory)

    Qval: (explanations by Osvaldo)
    If the forward/reverse reads ratio for a variant is deviating from the overall ratio "too much", then the p-value is low, i.e. there is strand bias and you cannot trust the variant call.
    Nevertheless, since we are running multiple tests (one for each variant call), the chance of false positive "triggers" (deviations from the expected ratio -> low p-value -> rejection of the variant call even if it should be accepted) is fairly high. In order to mitigate this, correction for multiple tests is invoked, specifically: Benjamini-Hochberg.
    More here https://xkcd.com/882/

    (Sorry for the slow answer, somehow I had missed the e-mail alert about @medstrand_gitlab 's post)

    In the VCF file:

    The QUAL score inside the standard VCF file is currently produced by combining the posterior probabilities of the 2 or 3 windows covering an SNV. (Currently the strand bias test has no influence to QUAL).
    As mentioned by @uludag, the individual columns of the CSV are also made available in the INFO field of the VCF.
    Patrik Medstrand
    Thanks for your responses and clarification. Much appreciated!
    I hope that it helps.