These are chat archives for broadinstitute/hail

24th
Aug 2016
Daniel King
@danking
Aug 24 2016 15:05
I've brought Jenkins offline and now only TeamCity will run hail builds
Laurent Francioli
@lfrancioli
Aug 24 2016 15:57
@tpoterba : If I see this hail: info: Coerced unsorted dataset, it means that the data wasn’t ordered, right?
Tim Poterba
@tpoterba
Aug 24 2016 15:58
that’s fine
interesting, wouldn’t expect that to happen often
there are 4 levels of sortedness
Laurent Francioli
@lfrancioli
Aug 24 2016 15:58
fine as in it means that the new ordering is working or not?
Tim Poterba
@tpoterba
Aug 24 2016 15:58
  1. completely sorted
Two. sorted by locus but not variant
Three. partitions are sorted according to min/max, but partitions are not internally sorted by locus
Four. partitions are unsorted
Laurent Francioli
@lfrancioli
Aug 24 2016 16:00
ok
Tim Poterba
@tpoterba
Aug 24 2016 16:00
1 involves no work
two involves a cheap mapPartitions
cseed
@cseed
Aug 24 2016 16:00
What file did you get this message on? That is really unlikely, so much so, it might be a bug.
Tim Poterba
@tpoterba
Aug 24 2016 16:00
three involves an expensive mapPartitions
Laurent Francioli
@lfrancioli
Aug 24 2016 16:00
a tiny test.vcf file
Tim Poterba
@tpoterba
Aug 24 2016 16:00
Four is a shuffle
oh
hah
1 partition will do that
cseed
@cseed
Aug 24 2016 16:01
Ah! It’s only one partition.
Tim Poterba
@tpoterba
Aug 24 2016 16:01
1 partition unsorted
cseed
@cseed
Aug 24 2016 16:01
So it will sort it but not do a shuffle.
That’s a cute optimization.
Laurent Francioli
@lfrancioli
Aug 24 2016 16:02
but it means that the data wasn’t seen as sorted, right?
cseed
@cseed
Aug 24 2016 16:02
Right. So it is probaby the X/Y/MT thing.
Is that possible?
Laurent Francioli
@lfrancioli
Aug 24 2016 16:02
yes, OK working on it
I should have a PR soon, but have to go for a while
Laurent Francioli
@lfrancioli
Aug 24 2016 17:29
So, I have a suggestion for contig ordering
but there is a deeper bug I think
It seems like no dataset is ever ordered
I ran this with the pubic hail command:
hail importvcf file:////home/unix/lfran/tools/hail/src/test/resources/sample.vcf.bgz count hail: info: running: importvcf file:////home/unix/lfran/tools/hail/src/test/resources/sample.vcf.bgz hail: info: Coerced unsorted dataset hail: info: running: count hail: info: count: nSamples 100 nVariants 346 hail: info: while importing: file:/home/unix/lfran/tools/hail/src/test/resources/sample.vcf.bgz import clean hail: info: timing: importvcf: 7.487s count: 760.901ms
anyways, this is the VCF in the resources and it has a single chromosome with all positions sorted
I have to go again, but I can debug more later
later
if someone else is going to look at this, here is my suggestion for the ordering:
object Contig {
  def compare(lhs: String, rhs: String): Int = {
    contigToInt(lhs).compare(contigToInt(rhs))
  }

  def contigToInt(contig: String) : Int = {
    if(contig.forall(_.isDigit)) contig.toInt
    else if(contig == "X") 23
    else if(contig == "Y") 24
    else if(contig == "MT") 25
    else contig.hashCode + 26
  }
}
Tim Poterba
@tpoterba
Aug 24 2016 17:34
I don’t think this works — you get random ordering for string contigs
“A”.hashCode is not necessarily smaller than “B”.hashCode
Laurent Francioli
@lfrancioli
Aug 24 2016 17:35
random but consistent
right, do we care?
Tim Poterba
@tpoterba
Aug 24 2016 17:35
I think we want “A” < “B"
Laurent Francioli
@lfrancioli
Aug 24 2016 17:35
if we do, I’ll do something else
OK, we should define what we want :)
I don’t think the order of String contigs that we have no clue about matters
Tim Poterba
@tpoterba
Aug 24 2016 17:36
I think we want numeric comparisons first, then X < Y < MT, then all other strings
Laurent Francioli
@lfrancioli
Aug 24 2016 17:36
but I may be wrong
Tim Poterba
@tpoterba
Aug 24 2016 17:36
hash code can also be negative I think
we want to be consistent though
Laurent Francioli
@lfrancioli
Aug 24 2016 17:36
ah, good point
OK, I’ll suggest something else then
but really seems like there is a deeper issue
OK, gotta go
Daniel King
@danking
Aug 24 2016 17:52
And now you should see links to the docs builds under the GitHub checks. You may need to click "Show all checks" to see the message about the docs. The "details" link will bring you directly to the version of the docs generated by the current PR.
cseed
@cseed
Aug 24 2016 20:05
@lfrancioli There?
Laurent Francioli
@lfrancioli
Aug 24 2016 20:06
Not really... Not in front of my computer
But I did implement something else for the contig. Haven't found the more general problem though (did not get much time)
cseed
@cseed
Aug 24 2016 20:08
Tim and I talked about it and I think we have a plan. I might go ahead and do something for contig now if you don’t mind.
Laurent Francioli
@lfrancioli
Aug 24 2016 20:09
Of course!
Daniel King
@danking
Aug 24 2016 21:05
This message was deleted
Daniel King
@danking
Aug 24 2016 21:25

Regarding writing tests using the random generators, how do people feel about these three options:

Option 1:

for (j <- forAll(Gen.choose(0, 10000));
  k <- forAll(Gen.choose(0, 10000));
) yield {
  val gt = if (j < k) GTPair(j, k) else GTPair(k, j)
  Genotype.gtPair(Genotype.gtIndex(gt)) == gt
}

Option 2:

forAll(Gen.choose(0, 10000)) { (j: Int) =>
forAll(Gen.choose(0, 10000)) { (k: Int) =>
  val gt = if (j < k) GTPair(j, k) else GTPair(k, j)
  Genotype.gtPair(Genotype.gtIndex(gt)) == gt
} }

Option 3:

forAll(Gen.choose(0, 10000), Gen.choose(0, 10000)) { (j, k) =>
  val gt = if (j < k) GTPair(j, k) else GTPair(k, j)
  Genotype.gtPair(Genotype.gtIndex(gt)) == gt
}

Note that the last option limits the nesting to whatever we provide in the
library whereas options one and two provide an arbitrary number of universally
quantified variables.

cseed
@cseed
Aug 24 2016 21:31
@tpoterba I merged in the fix #1 variant sort order.
And I’m working on repartition.
cseed
@cseed
Aug 24 2016 22:03
@tpoterba I’m done implementing repartition (need to test) but I realize we’ve implemented coalesce, not repartition. It only makes sense if the number of partitions is shrinking. But I think that’s all we ever use it for. So I suggest renaming it to coalesce and printing a polite warning and continuing if the requested number of partitions is greater than requested.
Tim Poterba
@tpoterba
Aug 24 2016 23:32
what do you do if n > current nPartitions?
nothing?
cseed
@cseed
Aug 24 2016 23:32
Crash, I suppose.
Actually, probably create a bunch of empty partitions.
Tim Poterba
@tpoterba
Aug 24 2016 23:32
haha
that’ll show em