These are chat archives for broadinstitute/hail

19th
Aug 2016
cseed
@cseed
Aug 19 2016 14:14
Goes without saying, but adding BKRs should be it’s own self-contained PR.
Tim Poterba
@tpoterba
Aug 19 2016 15:19
@cseed: can we discuss the lazy eval vs serialize stuff? broadinstitute/hail#629
cseed
@cseed
Aug 19 2016 15:32
Yes. I still think this doesn’t work. In my example, every sample will have a reference to the same combinator. So aggregation results for all samples will be the same, and they will have aggregated every item for every sample.
Does that make sense? I started writing up an example to illustrate the problem.
Tim Poterba
@tpoterba
Aug 19 2016 15:33
yeah it totally makes sense
but it’s also consistent with the way other collections work
cseed
@cseed
Aug 19 2016 15:33
The point of ser/deser is to copy objects. Java doesn’t have a built in way to do that.
Huh? Other collections don’t implicitly run your reduction in parallel.
Tim Poterba
@tpoterba
Aug 19 2016 15:33
scala> val arr = Array.fill(10)(Array.tabulate(3)(i => i))
arr: Array[Array[Int]] = Array(Array(0, 1, 2), Array(0, 1, 2), Array(0, 1, 2), Array(0, 1, 2), Array(0, 1, 2), Array(0, 1, 2), Array(0, 1, 2), Array(0, 1, 2), Array(0, 1, 2), Array(0, 1, 2))

scala> import scala.collection.mutable._
import scala.collection.mutable._

scala> val ab = new ArrayBuffer[Int]
ab: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer()

scala> arr.map(_.foldLeft(ab) { case (acc, i) => acc += i})
res0: Array[scala.collection.mutable.ArrayBuffer[Int]] = Array(ArrayBuffer(0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2), ArrayBuffer(0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2), ArrayBuffer(0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2), ArrayBuffer(0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2), ArrayBuffer(0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2), ArrayBuffer(0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2), ArrayBuffer(0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2), ...

scala> arr.map(_.foldLeft(new ArrayBuffer[Int]) { case (acc, i) => acc += i})
res2: Array[scala.collection.mutable.ArrayBuffer[Int]] = Array(ArrayBuffer(0, 1, 2), ArrayBuffer(0, 1, 2), ArrayBuffer(0, 1, 2), ArrayBuffer(0, 1, 2), ArrayBuffer(0, 1, 2), ArrayBuffer(0, 1, 2), ArrayBuffer(0, 1, 2), ArrayBuffer(0, 1, 2), ArrayBuffer(0, 1, 2), ArrayBuffer(0, 1, 2))
cseed
@cseed
Aug 19 2016 15:35
I don’t see the analog here. There is no situation where you would want sharing between samples.
Tim Poterba
@tpoterba
Aug 19 2016 15:36
I agree
but I also think that passing a zero value as new Combiner is different from passing an instantiated one
alright, we’ll leave it for now. I have another couple questions about the BKR though
cseed
@cseed
Aug 19 2016 15:41
fold and frineds in the scala collection don’t take lazy zero values.
So I would say the current behavior mirrors the collections.
Tim Poterba
@tpoterba
Aug 19 2016 15:42
we’ve got this:
trait BiKeyedRDDOps[R, C, E, T] {

  def data: BiKeyedRDDData[R, C, E]

  def newSelf(rdd: RDD[(R, (Annotation, Iterable[E]))] = data.rdd,
    columnKeys: IndexedSeq[C] = data.columnKeys,
    columnAnnotations: IndexedSeq[Annotation] = data.columnAnnotations,
    globalAnnotation: Annotation = data.globalAnnotation,
    rowAnnotationSchema: Type = data.rowAnnotationSchema,
    columnAnnotationSchema: Type = data.columnAnnotationSchema,
    globalAnnotationSchema: Type = data.globalAnnotationSchema): T
}
T is the type that is returned by newSelf
Daniel King
@danking
Aug 19 2016 15:42
I'm getting a bit turned around here. I want to build some tests for filterAlleles but I seem to be forced to write a lot of code to work around Array's having reference equality rather than structural. How do we normally work around having an Array hanging out in some arbitrary part of an Annotation?
cseed
@cseed
Aug 19 2016 15:42
I’m trying to see your point, but it seems to me like you’re confusing aggregateBySample as an abstraction barrier and its behavior as an implementation with a nested map/fold in scala collections.
Tim Poterba
@tpoterba
Aug 19 2016 15:43
yeah, maybe
cseed
@cseed
Aug 19 2016 15:43
Arrays in annotations should always be casted to IndexedSeq.
Tim Poterba
@tpoterba
Aug 19 2016 15:43
annotations obey structural equality
cseed
@cseed
Aug 19 2016 15:43
My other object is the change is very error prone. Subtle unwatned sharing is very hard to track down. (That’s how the ser/deser stuff got put in in the first place.)
Tim Poterba
@tpoterba
Aug 19 2016 15:44
yeah, that’s true
I’m convinced
alright BKR
cseed
@cseed
Aug 19 2016 15:44
OK, I see your code. BKR.
Tim Poterba
@tpoterba
Aug 19 2016 15:44
I want to add this function:
  def mapValues[E2](f: (E) => E2)(implicit ect: ClassTag[E2]): BiKeyedRDD[R, C, E2] = {
    mapValuesWithAll((r, ra, c, ca, e) => f(e))
  }
but it should be in BKROps, not BKRDD
cseed
@cseed
Aug 19 2016 15:44
Yes.
Daniel King
@danking
Aug 19 2016 15:44
How does casting the type change the implementation of equals that gets dispatched to?
Tim Poterba
@tpoterba
Aug 19 2016 15:44
so what’s the type signature?
it doesn’t @danking
everything that goes in annotations checks structural equality
cseed
@cseed
Aug 19 2016 15:45
(guessing the answer you want: Make T a higher level kind parameterized by R, C, T.)
Tim Poterba
@tpoterba
Aug 19 2016 15:45
ah
cseed
@cseed
Aug 19 2016 15:45
it does @danking
Daniel King
@danking
Aug 19 2016 15:45
O.o
Tim Poterba
@tpoterba
Aug 19 2016 15:45
oh, the array to indexed seq thing?
cseed
@cseed
Aug 19 2016 15:46
casting an array to IndexedSeq uses implicit conversion to convert it to WrappedArray
Tim Poterba
@tpoterba
Aug 19 2016 15:46
it’s not cast, it’s implicitly converted
cseed
@cseed
Aug 19 2016 15:46
scala> val is: IndexedSeq[Int] = Array(1, 2, 3)
is: IndexedSeq[Int] = WrappedArray(1, 2, 3)

scala> is.getClass
res0: Class[_ <: IndexedSeq[Int]] = class scala.collection.mutable.WrappedArray$ofInt
Sorry, yes, converted (implicitly or explicitly), not casted. My bad.
Daniel King
@danking
Aug 19 2016 15:47
oy vey
Thanks, that should fix it
cseed
@cseed
Aug 19 2016 15:47
The behavior of Arrays leaves much to be desired.
Daniel King
@danking
Aug 19 2016 15:47
Shouldn't we just never use the Array constructor then anywhere in our code?
Tim Poterba
@tpoterba
Aug 19 2016 15:47
but the performance of indexed seqs leaves much to be desired
cseed
@cseed
Aug 19 2016 15:47
They are like 1/2 first class citizens in the scala collections, for performance reasons. (For the same reason arrays don’t do erasure in Java, for example).
No, you want an array for performance, but then wrap it for reasonable behavior.
IndexedSeq constructor creates a Vector, which is terrible (fully boxed, bad overhad).
Daniel King
@danking
Aug 19 2016 15:48
Wouldn't always using the WrappedArray constructor achieve this?
cseed
@cseed
Aug 19 2016 15:49
Which constructor? Is has no useful ones. WrappedArray isn’t meant to be used explicitly. It just implements the Array -> IndexedSeq conversion.
Daniel King
@danking
Aug 19 2016 15:50
Ah.
cseed
@cseed
Aug 19 2016 15:51
This is as good as it gets:
scala> import scala.collection.mutable
import scala.collection.mutable

scala> mutable.WrappedArray.make(Array(1,2, 3))
res4: scala.collection.mutable.WrappedArray[Nothing] = WrappedArray(1, 2, 3)
Daniel King
@danking
Aug 19 2016 15:51
The printed form sort of implies a constructor that doesn't exist.
cseed
@cseed
Aug 19 2016 15:51
Very true.
Lies, damn lies, and .toString output.
Daniel King
@danking
Aug 19 2016 15:52
So we should only ever place WrappedArray into an Annotation position
never Array
Tim Poterba
@tpoterba
Aug 19 2016 15:52
right
cseed
@cseed
Aug 19 2016 15:53
Correct. See, for example, TArray.typeCheck.
Daniel King
@danking
Aug 19 2016 15:57
I feel like I want Annotation to be an existential type hidden behind a module barrier, behind which it's actually Any, but the module only provides the sensible constructors (i.e. var args array which actually produces a WrappedArray)
This should be achievable with scala classes no?
cseed
@cseed
Aug 19 2016 16:01
Yes, I think so.
It might be the same as wrapping it in a value class. class Annotation(val value: Any) extends AnyVal { … }. Any is effectively the same as T forSome { type T }.
Tim Poterba
@tpoterba
Aug 19 2016 16:20
any way to pass implicit values into a trait?
cseed
@cseed
Aug 19 2016 16:26
I don’t think so. You might need to do them per-call.
Tim Poterba
@tpoterba
Aug 19 2016 16:26
ugly
Daniel King
@danking
Aug 19 2016 16:36

On this line: https://github.com/broadinstitute/hail/blob/master/src/main/scala/org/broadinstitute/hail/variant/Genotype.scala#L77

We should replace pl with px, right?

I don't know what pl is so perhaps I'm not constructing genotypes correctly...

cseed
@cseed
Aug 19 2016 16:45
@tpoterba Actually, you can use abstract methods to pass things “up” into traits (like data), so: implicit def tct: ClassTag[T] should work! Then the implementer can implement it with implicit val tct: ClassTag[T]. Voila!
It seems like you might want to a bound on T, like T <: BKROps[R, C, E, T] or somesuch.
Yes, that’s definitely a bug!
cseed
@cseed
Aug 19 2016 17:26
@danking PL in the VCF FORMAT field is the phred-scaled (log base 10) genotype likilihoods, scaled so the minimum is 0. GL is the linear (unscaled) genotype probabilities (stored fixed point, sum to 32768). isDosage is the flag that indicates which is stored in px (which should really be called xl), dosage being GL.
Daniel King
@danking
Aug 19 2016 17:36
xl because x is either P or G?
I'll submit a PR
cseed
@cseed
Aug 19 2016 17:36
Exactly. Great!
Wait.
OK, it’s slightly more complicated.
PL is the likelihoods. If you have priors, you can apply bayes rule and get posteriors. Phred-scaled posteriors are called PP.
GL is linear likelihoods. GP is linear posteriors. In fact, dosages are GP.
So we’re really storing PL vs GP.
So maybe px is fine?
This should all be fixed by generalized genotype schemas where we can use more accurate names.
Daniel King
@danking
Aug 19 2016 17:56
mm
Daniel King
@danking
Aug 19 2016 18:16
@cseed $.get returns a promise?
"As of jQuery 1.5, all of jQuery's Ajax methods return a superset of the XMLHTTPRequest object. This jQuery XHR object, or "jqXHR," returned by $.get() implements the Promise interface, giving it all the properties, methods, and behavior of a Promise"
Daniel King
@danking
Aug 19 2016 18:18
Nice
cseed
@cseed
Aug 19 2016 18:18
I’m starting to learn some JS. Amazing.
Daniel King
@danking
Aug 19 2016 18:20
jQuery really provides a nicer interface than XMLHttpRequest
@cseed is there a syntax for referencing one command from another in the markdown?
cseed
@cseed
Aug 19 2016 18:22
Yes, just link like this:
[`importvcf`](#importvcf)
spaces get replaced by underscore.
This should be added to docs/DocsStyleGuide.md.
(and slashes)
cseed
@cseed
Aug 19 2016 18:34
Any other comments on the minor changes and mathjax stuff? Otherwise, I’ll merge it in.
Daniel King
@danking
Aug 19 2016 18:55
@cseed no comments, I was gonna build with it and take a look but I imagine it's fine to merge
I did notice that command links don't have any hover or visited behavior because the styling on code take preference
cseed
@cseed
Aug 19 2016 18:56
No, good idea to try it out. I wonder can we make Jenkins build the docs so we can try it out there before merging?
Daniel King
@danking
Aug 19 2016 18:56
not sure how we'd serve it
cseed
@cseed
Aug 19 2016 18:56
Good catch, let’s fix that.
Daniel King
@danking
Aug 19 2016 18:56
I guess you could dump it under some directory
cseed
@cseed
Aug 19 2016 18:57
We could throw it in /var/www somewhere and post a link in the Jenkins build page?
Yeah. I’ll make an issue for this.
Daniel King
@danking
Aug 19 2016 18:58
yeah I think that would work
cseed
@cseed
Aug 19 2016 18:58
Automate all the things!
Daniel King
@danking
Aug 19 2016 18:58
Because otherwise I have to get into a steady state with my existing stuff before I can switch branches to build and whatever. it's a pain
/giphy all the things
aww no giphy.
Tim Poterba
@tpoterba
Aug 19 2016 19:41
Any idea why VEP is failing, Cotton?
is it memory again?
cseed
@cseed
Aug 19 2016 19:43
I assume so. I haven’t looked over the logs. I just know they ran it a few times and it “failed”.
Daniel King
@danking
Aug 19 2016 20:27
How do we feel about per-function comments? I haven't noticed any such comments, but some of these filterAlleles functions perform complex enough tasks that I'd think a sign post would be helpful
Laurent Francioli
@lfrancioli
Aug 19 2016 20:42
@danking I would totally vote for it!
cseed
@cseed
Aug 19 2016 20:46
I’m not against it, we’ll jus thave to be careful about making sure the code and comments don’t get out of sync.
Tim Poterba
@tpoterba
Aug 19 2016 21:03
re the orderedRDD stuff, Cotton: the transformed thing has to stay
there are two different operations
cseed
@cseed
Aug 19 2016 21:03
OrderedLeftJoinRDD and OrderedLeftJoinDistinctRDD?
Tim Poterba
@tpoterba
Aug 19 2016 21:04
The left join takes two things that are partitioned the same (key K, partition key T), and joins by K. The transformed thing takes two things that are partitioned the same (key1 K1, key2 K2, partition key T) and joins by the partition key
the orderedLeftJoin and OrderedLeftJoinTransformed
cseed
@cseed
Aug 19 2016 21:05
Your capitalization confuses me. Can you fix?
Tim Poterba
@tpoterba
Aug 19 2016 21:06
in the code or here?
cseed
@cseed
Aug 19 2016 21:06
Here. You said “the transformed thing has to stay”. There are a bunch of transformed/untransformed pairs. Which one specifically are you talking about?
Tim Poterba
@tpoterba
Aug 19 2016 21:07
we can't remove one of the join RDDs
we need both
one does join by variant, one does join by locus
cseed
@cseed
Aug 19 2016 21:07
The operations on RichPairIterator, one literlaly calls the other.
They why is the result of OrderedLeftJoinTransformedRDD a RDD[(K1, …)]?
And not T?
Tim Poterba
@tpoterba
Aug 19 2016 21:09
because it only joins by T, it doesn't keep it
this is for annotate loci
cseed
@cseed
Aug 19 2016 21:11
OK, got it, I see what the intention is. I will keep studying the code. Thanks.
Tim Poterba
@tpoterba
Aug 19 2016 21:12
sure
it's definitely a weird operation
but it's what we need
cseed
@cseed
Aug 19 2016 21:19
PRs are fast and furious again! I love it!
cseed
@cseed
Aug 19 2016 21:24
Seeing all those sorts and persists get deleted makes me so happy.
Tim Poterba
@tpoterba
Aug 19 2016 21:24
yep
Daniel King
@danking
Aug 19 2016 21:31
I'd prefer to do this in person, so perhaps Monday someone could help me understand how to get a big dataset and run it on some beefy machines? I'd like to get a sense for the performance of filterAlleles on a large data set
cseed
@cseed
Aug 19 2016 21:31
Yep! We can hook you up on Monday.
cseed
@cseed
Aug 19 2016 21:44
@tpoterba OK, I finished the first pass. Looks great. I’m not totally happy with: the logic around OrderedRDD.apply, the transformed stuff and annotate loci, and the local sort implementation. I’ll suggest some changes for those tree (make a PR against your branch). Otherwise, feel free to address my other comments which I think we pretty minor.
Tim Poterba
@tpoterba
Aug 19 2016 21:51
will do