Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Laurent Francioli
    @lfrancioli
    This is the stack trace -- too long for gitter as usual :)
    gtiao
    @gtiao
    Pipeline is
    ht = ht.persist()
    rank_ht = ht.select(is_indel=hl.is_indel(ht.alleles[0], ht.alleles[1]), score=ht.CNN_1D_Score)
    rank_ht = rank_ht.order_by(rank_ht.is_indel, rank_ht.score)
    rank_ht = rank_ht.add_index().persist()
    n_snvs = rank_ht.aggregate(hl.agg.count_where(~rank_ht.is_indel))
    rank_ht = rank_ht.annotate(idx=hl.cond(rank_ht.is_indel, rank_ht.idx - n_snvs, rank_ht.idx))
    Daniel King
    @danking
    looking
    Konrad Karczewski
    @konradjk
    how many partitions did you end up with?
    gtiao
    @gtiao
    1000
    Daniel King
    @danking
    the issue seems to be that your partitions are too large to to shuffle
    gtiao
    @gtiao
    how can you tell?
    Daniel King
    @danking
    in particular they are greater than 2GB
    the stack trace is coming from the shuffling part of Spark’s code base
    gtiao
    @gtiao
    All right, I’ll crank up # partitions and see if that helps
    Daniel King
    @danking
    :ok_hand:
    gtiao
    @gtiao
    How do I tell how large my partition sizes are currently?
    Konrad Karczewski
    @konradjk
    wait yeah this isn't that much data
    gtiao
    @gtiao
    You know, so I can do the math
    Konrad Karczewski
    @konradjk
    sites only
    gsutil du -s -h gs://path/to
    Daniel King
    @danking
    certainly possible we have an implementation bug
    what’s the schema?
    gtiao
    @gtiao
    Contig    Pos    Ref    Alt    CNN_1D_Score
    ht = hl.import_table('gs://gnomad/variant_qc/temp/friedman_cnn_scores.tsv.gz', force_bgz=True, min_partitions=1000, impute=True)
    ht.write('gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht', overwrite=True)
    I’m getting 6.2 GB for the whole thing, so each partition should not be >2GB
    Daniel King
    @danking
    ok
    I’ll investigate
    Laurent Francioli
    @lfrancioli
    wait, are you sure there are 1k partitions ?
    Daniel King
    @danking
    is there anything inbetween the import table & write and the block above?
    Laurent Francioli
    @lfrancioli
    Laurent:hail2 laurent$ gsutil ls gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht/rows/parts/
    gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht/rows/parts/
    gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht/rows/parts/part-0-2-0-0-4c1e0ba4-7fc3-be60-93ab-7160eaea2afa
    gtiao
    @gtiao
    def main():
        hl.init(log='/variantqc.log')
    
        ht = hl.import_table('gs://gnomad/variant_qc/temp/friedman_cnn_scores.tsv.gz', force_bgz=True, min_partitions=1000, impute=True)
        ht.write('gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht', overwrite=True)
    
        ht = hl.read_table('gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht')
        ht = ht.annotate(alt_alleles=ht.Alt.split(','))  # This transforms to a list
        ht = ht.explode('alt_alleles')
        ht = ht.annotate(locus=hl.locus(hl.str(ht.Contig), ht.Pos))
    
        # Apply minrep
        ht = ht.annotate(alleles=hl.min_rep(ht.locus, [ht.Ref, ht.alt_alleles])[1])
    
        # Add variant_type
        ht = ht.annotate(vartype=add_variant_type(ht.alleles))
        ht = ht.transmute(variant_type=ht.vartype.variant_type, n_alt_alleles=ht.vartype.n_alt_alleles)
    
        # Add rank
        print('Adding rank...')
        ht = add_rank(ht)
        ht.key_by('locus', 'alleles').write('gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ranked.ht', overwrite=True)
    Laurent Francioli
    @lfrancioli
    Not as familliar with the new Hail data format, but shouldn't there be 1k parts in there?
    gtiao
    @gtiao
    Where add_rank() is the function that contains the order_by code
    Daniel King
    @danking
    uh
    Konrad Karczewski
    @konradjk
    yeah, might need to set the min_block_size
    Daniel King
    @danking
    there should be more than one anyway
    Konrad Karczewski
    @konradjk
    hl.init(min_block_size=0)
    @lfrancioli is correct
    Laurent Francioli
    @lfrancioli
    Isn't the default min_block_size 1Mb ?
    from the doc that's what I see
    but maybe the doc isn't accurate :)
    Daniel King
    @danking
    it’s definitely 1 and definitely measured in MB
    I just checked
    gtiao
    @gtiao
    Why does min block size matter here?
    Laurent Francioli
    @lfrancioli
    because the number of partitions depends on min_block_size (minimum partition size) and min_partitions (minimum number of partitions)
    Konrad Karczewski
    @konradjk
    oh didn't realize it was 1 mb, i thought it was larger
    but even then yeah that doesn't explain it
    is min_partitions in import_table not working?
    Laurent Francioli
    @lfrancioli
    that could explain it :)
    Daniel King
    @danking
    @gtiao it looks like you might be loading all the data into one partition?
    gtiao
    @gtiao
    I thought that was what I was trying to avoid by ht = hl.import_table('gs://gnomad/variant_qc/temp/friedman_cnn_scores.tsv.gz', force_bgz=True, min_partitions=1000, impute=True)
    Daniel King
    @danking
    min_block_size controls how small input file blocks can be
    I also think that would avoid it.