Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    gtiao
    @gtiao
    You know, so I can do the math
    Konrad Karczewski
    @konradjk
    sites only
    gsutil du -s -h gs://path/to
    Daniel King
    @danking
    certainly possible we have an implementation bug
    what’s the schema?
    gtiao
    @gtiao
    Contig    Pos    Ref    Alt    CNN_1D_Score
    ht = hl.import_table('gs://gnomad/variant_qc/temp/friedman_cnn_scores.tsv.gz', force_bgz=True, min_partitions=1000, impute=True)
    ht.write('gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht', overwrite=True)
    I’m getting 6.2 GB for the whole thing, so each partition should not be >2GB
    Daniel King
    @danking
    ok
    I’ll investigate
    Laurent Francioli
    @lfrancioli
    wait, are you sure there are 1k partitions ?
    Daniel King
    @danking
    is there anything inbetween the import table & write and the block above?
    Laurent Francioli
    @lfrancioli
    Laurent:hail2 laurent$ gsutil ls gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht/rows/parts/
    gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht/rows/parts/
    gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht/rows/parts/part-0-2-0-0-4c1e0ba4-7fc3-be60-93ab-7160eaea2afa
    gtiao
    @gtiao
    def main():
        hl.init(log='/variantqc.log')
    
        ht = hl.import_table('gs://gnomad/variant_qc/temp/friedman_cnn_scores.tsv.gz', force_bgz=True, min_partitions=1000, impute=True)
        ht.write('gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht', overwrite=True)
    
        ht = hl.read_table('gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ht')
        ht = ht.annotate(alt_alleles=ht.Alt.split(','))  # This transforms to a list
        ht = ht.explode('alt_alleles')
        ht = ht.annotate(locus=hl.locus(hl.str(ht.Contig), ht.Pos))
    
        # Apply minrep
        ht = ht.annotate(alleles=hl.min_rep(ht.locus, [ht.Ref, ht.alt_alleles])[1])
    
        # Add variant_type
        ht = ht.annotate(vartype=add_variant_type(ht.alleles))
        ht = ht.transmute(variant_type=ht.vartype.variant_type, n_alt_alleles=ht.vartype.n_alt_alleles)
    
        # Add rank
        print('Adding rank...')
        ht = add_rank(ht)
        ht.key_by('locus', 'alleles').write('gs://gnomad/variant_qc/temp/friedman_cnn_scores.no_chr17.ranked.ht', overwrite=True)
    Laurent Francioli
    @lfrancioli
    Not as familliar with the new Hail data format, but shouldn't there be 1k parts in there?
    gtiao
    @gtiao
    Where add_rank() is the function that contains the order_by code
    Daniel King
    @danking
    uh
    Konrad Karczewski
    @konradjk
    yeah, might need to set the min_block_size
    Daniel King
    @danking
    there should be more than one anyway
    Konrad Karczewski
    @konradjk
    hl.init(min_block_size=0)
    @lfrancioli is correct
    Laurent Francioli
    @lfrancioli
    Isn't the default min_block_size 1Mb ?
    from the doc that's what I see
    but maybe the doc isn't accurate :)
    Daniel King
    @danking
    it’s definitely 1 and definitely measured in MB
    I just checked
    gtiao
    @gtiao
    Why does min block size matter here?
    Laurent Francioli
    @lfrancioli
    because the number of partitions depends on min_block_size (minimum partition size) and min_partitions (minimum number of partitions)
    Konrad Karczewski
    @konradjk
    oh didn't realize it was 1 mb, i thought it was larger
    but even then yeah that doesn't explain it
    is min_partitions in import_table not working?
    Laurent Francioli
    @lfrancioli
    that could explain it :)
    Daniel King
    @danking
    @gtiao it looks like you might be loading all the data into one partition?
    gtiao
    @gtiao
    I thought that was what I was trying to avoid by ht = hl.import_table('gs://gnomad/variant_qc/temp/friedman_cnn_scores.tsv.gz', force_bgz=True, min_partitions=1000, impute=True)
    Daniel King
    @danking
    min_block_size controls how small input file blocks can be
    I also think that would avoid it.
    gtiao
    @gtiao
    Maybe force_bgz doesn’t work with min_partitions?
    It is actually bgzipped
    Daniel King
    @danking
    lemme spin up a little cluster to poke at that file
    Daniel King
    @danking
    blah. we need to make 0.2 the default such a long turn around time if I forget
    Daniel King
    @danking
    so its definitely being loaded in one partition
    which is obviously bad and wrong, and I’ll try to figure out why
    gtiao
    @gtiao
    OK, cool — thanks for looking into it!
    Daniel King
    @danking
    @gtiao you’re on latest master right?
    Daniel King
    @danking
    @gtiao yeah it’s force_bgz being broken somehow, if you can rename the file that will bypass the issue for now
    gtiao
    @gtiao
    Great — I will do that. I’ve been using a Konrad jar (gs://konradk/jars/hail-6d4d50458.jar) but I don’t recall what the specific issue was that we were trying to address with that
    Konrad Karczewski
    @konradjk
    am i to presume that if i see this in my file:
     'gs://gnomad/annotations/hail-0.2/ht/exomes/gnomad.exomes.family_stats.ht/rows/parts/part-04249-15-4249-0-0e67bae1-c1d2-5e25-aad1-eb8c419bbdbe',
     'gs://gnomad/annotations/hail-0.2/ht/exomes/gnomad.exomes.family_stats.ht/rows/parts/part-04249-15-4249-1-8e47c4bd-6382-0449-a9c9-7f3d29ea1511',
    that the later one is the correct one?
    klaricch
    @klaricch
    any thoughts how to follow up on the error below? ld_prune had worked on an exome matrix table but then I joined it with data from an array matrix table and lost a lot of variants and kept only GT as an entry field. not sure if i need to skip that join.
    mm_test = hl.ld_prune(mm.GT,r2=0.1)
    FatalError: ArrayIndexOutOfBoundsException: 6
    
    Java stack trace:
    java.lang.ArrayIndexOutOfBoundsException: 6
        at is.hail.methods.LocalLDPrune$.apply(LocalLDPrune.scala:294)
        at is.hail.methods.LocalLDPrune.apply(LocalLDPrune.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)
    
    Hail version: devel-15eaf7588401
    Error summary: ArrayIndexOutOfBoundsException: 6