These are chat archives for broadinstitute/hail

25th
Aug 2016
Tim Poterba
@tpoterba
Aug 25 2016 01:53
why’d you close the PR?
cseed
@cseed
Aug 25 2016 02:21
Mistake. Fixed.
Tim Poterba
@tpoterba
Aug 25 2016 17:04
@cseed some worrying benchmarks before and after rebasing the sorting fixes you pushed yesterday:
hail: info: running: importvcf profile225.vcf.bgz
[Stage 0:=====================================================>   (15 + 1) / 16]hail: info: Ordering unsorted dataset with network shuffle
hail: info: running: splitmulti
hail: info: running: variantqc
hail: info: running: count
[Stage 3:=================================================>       (14 + 2) / 16]hail: info: count:
  nSamples             2,535
  nVariants          236,734
hail: info: while importing:
    hdfs://dataflow01.broadinstitute.org:8020/user/tpoterba/profile225.vcf.bgz  import clean
hail: info: timing:
  importvcf: 15.752s
  splitmulti: 10.496ms
  variantqc: 39.452ms
  count: 1m16.4s
hail: info: running: importvcf profile225.vcf.bgz
[Stage 0:=====================================================>   (15 + 1) / 16]hail: info: Coerced sorted dataset
hail: info: running: splitmulti
hail: info: running: variantqc
hail: info: running: count
[Stage 1:=====================================================>   (15 + 1) / 16]hail: info: count:
  nSamples             2,535
  nVariants          236,734
hail: info: while importing:
    hdfs://dataflow01.broadinstitute.org:8020/user/tpoterba/profile225.vcf.bgz  import clean
hail: info: timing:
  importvcf: 15.956s
  splitmulti: 10.252ms
  variantqc: 39.146ms
  count: 2m3.4s
cseed
@cseed
Aug 25 2016 17:14
Hmm. That’s surprising.
Tim Poterba
@tpoterba
Aug 25 2016 17:14
yeah.
cseed
@cseed
Aug 25 2016 17:14
These are consistent on dataflow (no waiting, other jobs, etc.)?
Tim Poterba
@tpoterba
Aug 25 2016 17:14
I’m going to investigate a bit more when I get a chance
yes
cseed
@cseed
Aug 25 2016 17:15
Only rebased the one commit?
Wait, and the first one does a shuffle and the second one doesn’t? Hmm.
cseed
@cseed
Aug 25 2016 17:32
OK, I made another coalesce commit, moved coalesce to OrderedRDD, handle dependencies correctly (wasn’t doing that before) and compute preferred locations.