Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Brad Chapman
    @chapmanb
    This message was deleted
    Alistair Miles
    @alimanfoo
    pysamstats 0.24.2 is out, https://github.com/alimanfoo/pysamstats. This is a bug fix release resolving a couple of fairly serious problems. Upgrading is highly recommended.
    Alistair Miles
    @alimanfoo
    scikit-allel 0.20.3 is out, release notes here http://scikit-allel.readthedocs.org/en/latest/release.html. This is a bug fix release, in particular it fixes a nasty bug in the count_alleles() method on genotype and haplotype array classes, upgrading is recommended.
    Alistair Miles
    @alimanfoo
    The cggh/biipy:v1.6.0 docker image is out, see https://github.com/cggh/biipy. It's what we use for interactive analysis of genome variation data from the Ag1000G project.
    Nick Harding
    @hardingnj
    The cggh/biipy:v2.1.0 docker image is out, see https://github.com/cggh/biipy. It's what we use for interactive analysis of genome variation data from the Ag1000G project. This release leans on conda/bioconda, rather than pip. Build times were becoming prohibitive, so we split the image into biipy and biipy_base. Expect further incremental updates as we migrate remainder of packages to conda recipes.
    Deren Eaton
    @dereneaton
    @alimanfoo It seems currently that the allel.stats.decomposition.pca function does not allow for missing data in the genotypes array, otherwise it raises a ValueError due to NaNs. If I build the geno array using geno = genotypes.to_n_alt() it will work, but with geno = genotypes.to_n_alt(fill=-1) it does not work. The problem with the first way though is that the default fill=0 will make the missing data appear as homozygous for the reference allele, which will greatly bias the results. Any ideas on how to work around this? Thanks!
    Deren Eaton
    @dereneaton
    To summarize more clearly:
    geno
    # array([[0, 0, 0, ..., 1, 0, 0],
    #       [0, 0, 0, ..., 0, 2, 2],
    #       [0, 0, 0, ..., 0, 2, 2],
    #     ..., 
    #       [0, 0, 0, ..., 0, 0, 0],
    #       [0, 0, 0, ..., 0, 2, 2],
    #       [0, 0, 0, ..., 0, 0, 0]], dtype=int8)
    
    np.any(np.isnan(geno))
    ## False
    np.any(np.isinf(geno))
    ## False
    
    allel.pca(geno, 5)
    ## ... ValueError: array must not contain infs or NaNs
    Alistair Miles
    @alimanfoo
    @dereneaton apologies for the slow response. I believe the ValueError in the example above occurs because some variants are invariant, i.e., all individuals have the same genotype value (e.g., 0 when coded as no. alt alleles). This should go away if the genotype array is first subsetted to contain only segregating variants. However, this doesn't address your first question regarding how to cope with missing calls. I don't know of any way to work around this, apart from subsetting variants to include only those with very low levels of missingness, to minimise reference bias. The original Patterson paper talks about this briefly IIRC, but has no workaround either other than to run a PCA on missingness to see if there is any systematic signal being introduced. I've raised an issue cggh/scikit-allel#143
    Deren Eaton
    @dereneaton
    @alimanfoo thanks!