Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Konrad Karczewski
    @konradjk
    cc @patrick-schultz
    this is the same error i got yesterday
    great that you got a reproducible example - mine was deep in the bowels of sample qc so it was hard to get
    actually, what are the describe/keys of snv_variants?
    and is that one ordered on disk? (what's in ....ht/rows/metadata.json.gz - does it say OrderedRVDSpec or UnpartitionedRVDSpec)
    Laurent Francioli
    @lfrancioli
    It is OrderedRVDSpec
    And this is the description:
    ----------------------------------------
    Global fields:
        None
    ----------------------------------------
    Row fields:
        'locus': locus<GRCh38> 
        'alleles': array<str> 
        'n_discordant': int64 
        'concordance': array<array<int64>> 
        'dataset': str 
        'v2_callstats': struct {
            AC: array<int32>, 
            AF: array<float64>, 
            AN: int32, 
            homozygote_count: array<int32>
        } 
        'v2_was_split': bool 
        'v3_callstats': struct {
            AC: array<int32>, 
            AF: array<float64>, 
            AN: int32, 
            homozygote_count: array<int32>
        } 
        'v3_was_split': bool 
    ----------------------------------------
    Key: ['locus', 'alleles']
    ----------------------------------------
    Not sure if this helps, but annotating it with another Table that I just created doesn't cause the problem.
    Annotating with either of the two tables I want causes it
    shamsudheen kv
    @kvshams
    is there hail.VariantDataset.filter_variants_table() equivalent in hail 2.0
    Daniel King
    @danking
    mt.filter_rows(hl.is_defined(t[mt.row_key]))—basically: join the table on the row key of the matrix table and keep rows where the table has that key
    @kvshams ^
    shamsudheen kv
    @kvshams
    thanks :+1:
    gtiao
    @gtiao
    Hi, guys — I’m running an order_by on a genome sites-only Table that keeps failing with java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
    What can I do to get it to work?
    Daniel King
    @danking
    whats the first several lines of the stack trace after that message?
    ?
    This is the stack trace -- too long for gitter as usual :)
    gtiao
    @gtiao
    Pipeline is
    ht = ht.persist()
    rank_ht = ht.select(is_indel=hl.is_indel(ht.alleles[0], ht.alleles[1]), score=ht.CNN_1D_Score)
    rank_ht = rank_ht.order_by(rank_ht.is_indel, rank_ht.score)
    rank_ht = rank_ht.add_index().persist()
    n_snvs = rank_ht.aggregate(hl.agg.count_where(~rank_ht.is_indel))
    rank_ht = rank_ht.annotate(idx=hl.cond(rank_ht.is_indel, rank_ht.idx - n_snvs, rank_ht.idx))
    Daniel King
    @danking
    looking
    Konrad Karczewski
    @konradjk
    how many partitions did you end up with?
    gtiao
    @gtiao
    1000
    Daniel King
    @danking
    the issue seems to be that your partitions are too large to to shuffle
    gtiao
    @gtiao
    how can you tell?
    Daniel King
    @danking
    in particular they are greater than 2GB
    the stack trace is coming from the shuffling part of Spark’s code base
    gtiao
    @gtiao
    All right, I’ll crank up # partitions and see if that helps
    Daniel King
    @danking
    :ok_hand:
    gtiao
    @gtiao
    How do I tell how large my partition sizes are currently?
    Konrad Karczewski
    @konradjk
    wait yeah this isn't that much data
    gtiao
    @gtiao
    You know, so I can do the math
    Konrad Karczewski
    @konradjk
    sites only
    gsutil du -s -h gs://path/to
    Daniel King
    @danking
    certainly possible we have an implementation bug
    what’s the schema?
    gtiao
    @gtiao
    Contig    Pos    Ref    Alt    CNN_1D_Score
    ht = hl.import_table('gs://gnomad/variant_qc/temp/friedman_cnn_scores.tsv.gz', force_bgz=True, min_partitions=1000, impute=True)
    ht.write('gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht', overwrite=True)
    I’m getting 6.2 GB for the whole thing, so each partition should not be >2GB
    Daniel King
    @danking
    ok
    I’ll investigate
    Laurent Francioli
    @lfrancioli
    wait, are you sure there are 1k partitions ?
    Daniel King
    @danking
    is there anything inbetween the import table & write and the block above?
    Laurent Francioli
    @lfrancioli
    Laurent:hail2 laurent$ gsutil ls gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht/rows/parts/
    gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht/rows/parts/
    gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht/rows/parts/part-0-2-0-0-4c1e0ba4-7fc3-be60-93ab-7160eaea2afa
    gtiao
    @gtiao
    def main():
        hl.init(log='/variantqc.log')
    
        ht = hl.import_table('gs://gnomad/variant_qc/temp/friedman_cnn_scores.tsv.gz', force_bgz=True, min_partitions=1000, impute=True)
        ht.write('gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht', overwrite=True)
    
        ht = hl.read_table('gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht')
        ht = ht.annotate(alt_alleles=ht.Alt.split(','))  # This transforms to a list
        ht = ht.explode('alt_alleles')
        ht = ht.annotate(locus=hl.locus(hl.str(ht.Contig), ht.Pos))
    
        # Apply minrep
        ht = ht.annotate(alleles=hl.min_rep(ht.locus, [ht.Ref, ht.alt_alleles])[1])
    
        # Add variant_type
        ht = ht.annotate(vartype=add_variant_type(ht.alleles))
        ht = ht.transmute(variant_type=ht.vartype.variant_type, n_alt_alleles=ht.vartype.n_alt_alleles)
    
        # Add rank
        print('Adding rank...')
        ht = add_rank(ht)
        ht.key_by('locus', 'alleles').write('gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ranked.ht', overwrite=True)
    Laurent Francioli
    @lfrancioli
    Not as familliar with the new Hail data format, but shouldn't there be 1k parts in there?
    gtiao
    @gtiao
    Where add_rank() is the function that contains the order_by code
    Daniel King
    @danking
    uh
    Konrad Karczewski
    @konradjk
    yeah, might need to set the min_block_size