These are chat archives for broadinstitute/hail

22nd
Aug 2016
Daniel King
@danking
Aug 22 2016 13:53
Continuous integration and continuous deployment seem like socialism and communism. Everyone wants the latter and believes it is perfection and bliss but in reality we end up with the former and everyone gripes about it.
My thoughts while looking for Jenkins alternatives 😬
Tim Poterba
@tpoterba
Aug 22 2016 14:06
@cseed: let's discuss the OrderedRDD stuff?
also Dan
cseed
@cseed
Aug 22 2016 14:14
Let’s discuss. I’m planning on workng on it with top priority today.
Tim Poterba
@tpoterba
Aug 22 2016 14:16
alright great
I addressed most of your comments and pushed
I've also taken a stab at a new join algorithm that skips around the right RDD
cseed
@cseed
Aug 22 2016 14:17
Great! I’ll take a look.
Tim Poterba
@tpoterba
Aug 22 2016 14:17
I'll push that up as maybe a PR into my own branch
cseed
@cseed
Aug 22 2016 14:17
OK, super.
Tim Poterba
@tpoterba
Aug 22 2016 14:17
it does a LOT more work per element though, so I think the best solution will be to check how many partitions on the right are loaded, and choose based on a cutoff
cseed
@cseed
Aug 22 2016 14:17
I’m going to break my changes into a few small pieces, too, and will push them up as PRs one at at time.
Tim Poterba
@tpoterba
Aug 22 2016 14:18
less than 5 partitions (say) will use the current join, > 5 will use the skipping one
cseed
@cseed
Aug 22 2016 14:18
It does? Hmm. I was imaginging that it should do at most one extra key comparison per element.
Tim Poterba
@tpoterba
Aug 22 2016 14:18
you can take a look
cseed
@cseed
Aug 22 2016 14:18
Sounds good.
Tim Poterba
@tpoterba
Aug 22 2016 14:19
it was pretty challenging to make it work, and it certainly doesn't look nice
Tim Poterba
@tpoterba
Aug 22 2016 15:43
@cseed: this is the keySortIterator implementation we had at one point:

 class KeySortIterator[T, K, V](it: Iterator[(K, V)])(implicit tOrd: Ordering[T], kOrd: Ordering[K],        
   ev: (K) => T) extends Iterator[(K, V)] {        

   val b = mutable.ArrayBuffer[(K, V)]()        
   b.clear()        
   var takingSortedIterator: Iterator[(K, V)] = Iterator()        

   override def hasNext: Boolean = takingSortedIterator.nonEmpty || b.nonEmpty || it.nonEmpty        

   def next: (K, V) = {        
     if (takingSortedIterator.hasNext)        
       takingSortedIterator.next()        
     else if (it.isEmpty) {        
       assert(b.nonEmpty)        
       if (b.size == 1) {        
         val toReturn = b.head        
         b.clear()        
         toReturn        
       }        
       else {        
         takingSortedIterator = b.sortBy(_._1).iterator        
         b.clear()        
         next()        
       }        
     } else if (b.isEmpty) {        
       b += it.next()        
       next()        
     } else {        
       val (k, v) = it.next()        
       val t = ev(k)        
       val lastT = ev(b.last._1)        
       assert(tOrd.gteq(t, lastT), "iterator not K-sorted")        
       if (tOrd.equiv(t, lastT)) {        
         b += ((k, v))        
         next()        
       } else {        
         if (b.size == 1) {        
           // common case        
           val last = b(0)        
           b(0) = (k, v)        
           last        
         } else {        
           takingSortedIterator = b.sortBy(_._1).iterator        
           b.clear()        
           b += ((k, v))        
           next()        
         }        
       }        
     }        
   }        
 }
Daniel King
@danking
Aug 22 2016 16:31
@cseed & others, so I got TeamCity set up locally. It was frustrating to figure out their terminology and get things working, but it now seems to work properly (i.e. it actually cleans up workspaces when its done and has settings for deleting artifacts based on age and whether it was a success or not)
The free license is sufficient for our purposes because we don't need more than three build agents and more than 20 configurations https://www.jetbrains.com/teamcity/buy/#license-type=new-license
The UI is nicer than Jenkins, but otherwise it seems mostly the same. I'm miffed that working around Jenkins not cleaning up after builds properly means using a whole new product (and one that requires a separate database that we have to stand-up). Our alternative is to have some cron job that cleans up after jenkins periodically at some time when no one is using it. I could go either way. Let me know how we should proceed.
cseed
@cseed
Aug 22 2016 17:23
@tpoterba OK, I made my first PR against your partitioning branch. I killed the Transformed stuff and a few more minor changes.
Tim Poterba
@tpoterba
Aug 22 2016 17:23
great! I'll take a look
cseed
@cseed
Aug 22 2016 17:23
I also made (K) => T explicit, which I’m not 100% sure is an improvement.
Tim Poterba
@tpoterba
Aug 22 2016 17:23
I thought about that, too
cseed
@cseed
Aug 22 2016 17:46

We could have:

trait OrderedKey[K: Ordering] {
  type PartitioningKey: Ordering
  def project(k: K) => PartitionKey
}

And implement OrderedKey[Variant] and OrderedKey[Locus] implicitly.

Tim Poterba
@tpoterba
Aug 22 2016 18:13
I like that
cseed
@cseed
Aug 22 2016 18:13
Me too. I’ll add that.
Tim Poterba
@tpoterba
Aug 22 2016 18:14
alright. Also, do you strongly feel that projectKey should just go in the partitioner and not the orderedRDD?
cseed
@cseed
Aug 22 2016 18:15
It just seemed unnecessarily duplicated since OrderedRDD always has a Partitioner and with the explicit change, we have to pass it around everywhere.
Tim Poterba
@tpoterba
Aug 22 2016 18:15
got it
oh, you can access it
cseed
@cseed
Aug 22 2016 18:15
But no, I don’t feel stringly.
Yeah, it is a case class.
It should start picking up anyone's commits to PRs or commits to master
There's a test PR I have that is building now. This is what it looks like in TeamCity: http://ci.hail.is:8111/viewLog.html?buildId=1&tab=buildResultsDiv&buildTypeId=HailSourceCode_HailCi
Tim Poterba
@tpoterba
Aug 22 2016 19:03
credentials are in hail-internal?
Daniel King
@danking
Aug 22 2016 19:03
You should be able to log in as guest
I will place the credentials in hail-internal shortly
for the root user
Tim Poterba
@tpoterba
Aug 22 2016 19:03
ah cool
Daniel King
@danking
Aug 22 2016 19:04
The build log is a bit spammy right now (because of initial maven dependency downloads) but it has a nice "folding" feature http://ci.hail.is:8111/viewLog.html?buildId=1&buildTypeId=HailSourceCode_HailCi&tab=buildLog
And on the build page if you click on the failed test it shows you the relevant console output
Daniel King
@danking
Aug 22 2016 19:09
I suppose we can leave them both running for a couple days to see if TeamCity is copacetic. If it's all good we can shutdown Jenkins
Daniel King
@danking
Aug 22 2016 19:14
Also worth noting that some of our disk space problems come from the build directory being ~400MB
The tar file is 200MB and the zip file is 180MB
ah but maybe jenkins doesn't use gradle build
cseed
@cseed
Aug 22 2016 19:28
@tpoterba rewrote localKeySort, made another PR. I’m going to go over the OrderedRDD constructor next (and add OrderedKey trait)
This exposed a bug in the constructor, so I’m just forcing to shuffle each time right now.
Tim Poterba
@tpoterba
Aug 22 2016 19:32
alright.
what was the bug?
cseed
@cseed
Aug 22 2016 19:33
testVariantTSVAnnotation was failing becuase the last annotation line was disappearing, and it was annotating None instead whatever it was supposed to be.
Tim Poterba
@tpoterba
Aug 22 2016 19:33
that is bad.
cseed
@cseed
Aug 22 2016 19:34
If you comment back in the shuffle optimization constructor you should be able to reproduce it.
Tim Poterba
@tpoterba
Aug 22 2016 19:34
are we sure it's a bug in the constructor and not the local key sort?
I guess we're testing localKeySort elsewhere
so that's ok
cseed
@cseed
Aug 22 2016 19:36
Yes, I tested it carefully, it is not the problem here.
Tim Poterba
@tpoterba
Aug 22 2016 19:36
okay great
does it.buffered give you access to the first element without nexting it?
Daniel King
@danking
Aug 22 2016 19:42
Indeed, the docs claim: "Buffered iterators are iterators which provide a method head that inspects the next element without discarding it."
Tim Poterba
@tpoterba
Aug 22 2016 19:42
yep, saw that too when I looked at the code in intelliJ
that’s handy
cseed
@cseed
Aug 22 2016 19:46
Yep! .head.
cseed
@cseed
Aug 22 2016 19:57

@tpoterba Want to make sure I understand the goals of the OrderedRDD constructor:

  1. fast case if it is OrderedRDD or has OrderedPartition already,
  2. scan through once to check if it is sorted to avoid shuffle,
  3. if there is a cheap way to get the keys (e.g. parsing VCF), use those in (2) and to compute the range partitioning.

Yes?

Tim Poterba
@tpoterba
Aug 22 2016 19:57
yep
cseed
@cseed
Aug 22 2016 20:02
And then you handle T-partitioned but not K-sorted, and T-partitioned and T-sorted but not K-sorted?
Tim Poterba
@tpoterba
Aug 22 2016 20:02
yes
that’s what fromPartitionSummaries does
cseed
@cseed
Aug 22 2016 20:03
OK, great.
Tim Poterba
@tpoterba
Aug 22 2016 20:10
I don’t see a way for the problem not to be with localKeySort
          val sortedRDD = rdd.mapPartitionsWithIndex { case (i, it) =>
            val is = it.toIndexedSeq
              val r = (if (arr(i)._2) is.iterator
              else is.iterator.localKeySort[T](projectKey)).toIndexedSeq
              println(r.toSet == is.toSet)
              r.iterator
}
that’s printing false
cseed
@cseed
Aug 22 2016 20:11
The bug is in the input to localKeySort. It’s not getting all the elements on the right.
Tim Poterba
@tpoterba
Aug 22 2016 20:11
what do you mean in the input?
cseed
@cseed
Aug 22 2016 20:12
Oops, I tracked the bug down to sortedLeftJoinDistinct (oops, I left all the prints in there, sorrY).
And it wasn’t getting everything on the right.
Tim Poterba
@tpoterba
Aug 22 2016 20:12
yeah
it looks like localKeySort is truncating the last element
the code I pasted is in the OrderedRDD constructor
cseed
@cseed
Aug 22 2016 20:13
Ah, sorry, when I said localKeySort I meant orderedLeftJoinDistinct.
Tim Poterba
@tpoterba
Aug 22 2016 20:14
oh
you did have a bug
cseed
@cseed
Aug 22 2016 20:14
Yes, I did have a bug. Making it always shuffle fixed it (for the moment). I thought it was in the constructor (which calls localKeySort).
Tim Poterba
@tpoterba
Aug 22 2016 20:15
nope it was here: def hasNext: Boolean = it.hasNext || q.nonEmpty
cseed
@cseed
Aug 22 2016 20:15
Oops, hasNext calls it instead of bit
Tim Poterba
@tpoterba
Aug 22 2016 20:15
yeah
cseed
@cseed
Aug 22 2016 20:15
Yep! Good catch, thanks!
Tim Poterba
@tpoterba
Aug 22 2016 20:21
I wondered why the test didn’t catch it:
! Falsified after 10 passed tests.
heh
cseed
@cseed
Aug 22 2016 20:22
That’s bad. I do wonder why. It should basically fail every time.
Tim Poterba
@tpoterba
Aug 22 2016 20:23
it would only fail if the queue was empty when it gets to the last element
cseed
@cseed
Aug 22 2016 20:54
So that’s about a 1 in 50 chance. Hrm. Count could probably be higher. Daniel and I have been talking about way to improve the generators to catch some of the bugs we’ve missed.
Tim Poterba
@tpoterba
Aug 22 2016 20:55
yeah, the other solution for that was to increase the sparsity
cseed
@cseed
Aug 22 2016 20:55
I rewrote apply. Testing now. Code is looking really good.
In apply, we should be able to ask if any partition is overlapping, then renumber the partitions so they’re in order.
Tim Poterba
@tpoterba
Aug 22 2016 20:56
I agree
cseed
@cseed
Aug 22 2016 20:57
then if we can figure out how not to split across chromosome boundaries, we’re goldne.
Tim Poterba
@tpoterba
Aug 22 2016 20:57
yep
Tim Poterba
@tpoterba
Aug 22 2016 21:05
@cseed: the partition skipping stuff is here
cseed
@cseed
Aug 22 2016 21:14
Quick Q, when a partition is empty, you bail and shuffle, yes?
Tim Poterba
@tpoterba
Aug 22 2016 21:14
yeah
that’s sorta the same problem as the rearranging partitions
cseed
@cseed
Aug 22 2016 21:15
Yeah. Super annoying :)
cseed
@cseed
Aug 22 2016 21:43
OK, new PR for OrderedRDD constructor cleanup.
Hmm, conflicts, let me look at that.
cseed
@cseed
Aug 22 2016 21:53
@tpoterba OK, fixed. To you.
cseed
@cseed
Aug 22 2016 21:58
I will look at your PR next.
Daniel King
@danking
Aug 22 2016 22:16
Oh this is horrible and awesome:
∀ { (j: Int) => ∀ { (k: Int) => j < 2 || k < 2 || j * k > j + k } }.check()
Which isn't actually true due to overflow :(
(the above is honest to goodness compiling Scala)
cseed
@cseed
Aug 22 2016 22:29
@tpoterba I’m running home but will probably try to finish this all tonight. Let me know how you want to proceed with the PRs. I looked at your skip code and have some proposed cleanups.
Tim Poterba
@tpoterba
Aug 22 2016 23:06
I merged your branch now, still want to address the comments though. If you have changes for the skip stuff, maybe checkout the branch and make a new PR against partitioning_2?
cseed
@cseed
Aug 22 2016 23:12
Haha, OK! Branches all the way down.
Yes, I have some things that I think help clean it up.