These are chat archives for broadinstitute/hail

18th
Aug 2016
Tim Poterba
@tpoterba
Aug 18 2016 13:05
The latest PR prompted me to look at the exportvcf code, and I think we can speed it up by a lot
(in exportvcf)
gs.foreach { g =>
          sb += '\t'
          sb.append(g)
        }

(in Genotype)
  override def toString: String = {
    val b = new StringBuilder

    b.append(gt.map { gt =>
      val p = Genotype.gtPair(gt)
      s"${ p.j }/${ p.k }"
    }.getOrElse("./."))
    b += ':'
    b.append(ad.map(_.mkString(",")).getOrElse("."))
    b += ':'
    b.append(dp.map(_.toString).getOrElse("."))
    b += ':'
    b.append(gq.map(_.toString).getOrElse("."))
    b += ':'
    if (!isDosage)
      b.append(pl.map(_.mkString(",")).getOrElse("."))
    else
      b.append(dosage.map(_.mkString(",")).getOrElse("."))
    b.result()
  }
cseed
@cseed
Aug 18 2016 13:07
Yes. Lots of extra allocation.
Tim Poterba
@tpoterba
Aug 18 2016 13:07
this should be a function on Genotype with signature scala write(sb: StringBuilder): Unit, then toString should just be
def toString {
  val sb = new StringBuilder
  write(sb)
  sb.result()
}
cseed
@cseed
Aug 18 2016 13:08
Also, while is solution is clever, doing a regex/string edit per genotype is slightly horrifying to me.
Tim Poterba
@tpoterba
Aug 18 2016 13:08
yeah
cseed
@cseed
Aug 18 2016 13:08
In fact, I’d say toString shouldn’t be used for this and it should be in the ExportVCF code.
Tim Poterba
@tpoterba
Aug 18 2016 13:09
why not in genotype? that way if anyone else needs to write out genotypes, it’s there
cseed
@cseed
Aug 18 2016 13:09
Before we reply, let’s talk about it today.
Tim Poterba
@tpoterba
Aug 18 2016 13:09
sure
cseed
@cseed
Aug 18 2016 13:09
I want to make sure he has a good experience so he keeps contributing.
Tim Poterba
@tpoterba
Aug 18 2016 13:09
yeah for sure
do you know who it is?
cseed
@cseed
Aug 18 2016 13:10
Yeah. He emailed me about Hail after I posted a comment on Hacker News, and prompted me to add the “starter” issue tag.
Tim Poterba
@tpoterba
Aug 18 2016 13:10
oh, cool!
cseed
@cseed
Aug 18 2016 13:11
Yeah! I’m super psyched he contributed.
Tim Poterba
@tpoterba
Aug 18 2016 13:11
I noticed that. starter is like a little background so anyone can go address an issue?
cseed
@cseed
Aug 18 2016 13:11
Yep!
I’m going to put a little language on hail.is about getting started working on Hail and link to the starter issues.
Tim Poterba
@tpoterba
Aug 18 2016 13:11
cool
is the intent of starter to be something lower-priority that should be left to the wider community?
cseed
@cseed
Aug 18 2016 13:13
Not necessarily low-priority. We should feel free to do starter things if the time is right.
Tim Poterba
@tpoterba
Aug 18 2016 13:13
alright
cseed
@cseed
Aug 18 2016 13:13
But smaller, self-contained and well-defined, and not too hard.
Tim Poterba
@tpoterba
Aug 18 2016 13:13
got it
Tim Poterba
@tpoterba
Aug 18 2016 21:30
@cseed I’m struggling to write BiKeyedRDD in such a way that VSM operations easily override
metadata is the problem
cseed
@cseed
Aug 18 2016 21:30
Override? It should literally implement VSM operations once you do the type substitation.
Tim Poterba
@tpoterba
Aug 18 2016 21:31
right, but what about copy?
I need to get the VSM metadata in there somehow
this is very easy if I give the BiKeyedRDD a type parameter [M <: AbstractMetadata]
cseed
@cseed
Aug 18 2016 21:32
BiKeyedMatrix shouldn’t have a copy. VSM should.
Or maybe BKM should have a copy, but only handle global/row/column annotations. (Are globals in BKM or VSM?)
Tim Poterba
@tpoterba
Aug 18 2016 21:33
I put them in BKM for now, that can change
cseed
@cseed
Aug 18 2016 21:33
That was my inclination.
But VSM has to override BKM’s copy signature so it can copy it’s own stuff.
Tim Poterba
@tpoterba
Aug 18 2016 21:34
but that doesn’t typecheck
cseed
@cseed
Aug 18 2016 21:34
That way, say, BKM.cache can call copy but it creates a new VSM.
Tim Poterba
@tpoterba
Aug 18 2016 21:34
that’s the problem
cseed
@cseed
Aug 18 2016 21:34
It has to return a this.type instead of BKM.
Tim Poterba
@tpoterba
Aug 18 2016 21:34
it’s not the return type, it’s the arg type
you have to get the metadata in there
cseed
@cseed
Aug 18 2016 21:38
Sketch code:
class BKM[R, C, E](val rdd: RDD[R, Annotation, Iterable[E]], saType: Type, ...) {
  def copy(rdd = this.rdd, saType: Type): this.type  = { new BKM(rdd, saType, ...) }
}

class VSM[E] extends BKM[Variant, String, E](m: VariantMetadata, rdd, saType, ...) {
  def copy(rdd = this.rdd, saType: Type): this.type = { new VSM(m, rdd, saType, …)
}
Tim Poterba
@tpoterba
Aug 18 2016 21:39
so VSM needs a new function copyMetadata
cseed
@cseed
Aug 18 2016 21:39
Yes. A second copy function that includes the BKM fields as well as its own metadata fields.
I’d call it copy and overload.
Tim Poterba
@tpoterba
Aug 18 2016 21:40
ah
okay
cseed
@cseed
Aug 18 2016 21:45
Hmm, that might not work, the two calls will be ambigous.
Tim Poterba
@tpoterba
Aug 18 2016 21:46
this is a smaller problem, I think I have what I need
cseed
@cseed
Aug 18 2016 21:50
You can’t use this.type here.
Tim Poterba
@tpoterba
Aug 18 2016 21:50
why?
cseed
@cseed
Aug 18 2016 21:50
Also, inheritence of copy definitely doesn’t work.
scala> class A(i: Int) { def copy(i: Int = this.i): this.type = new A(i) }
<console>:8: error: type mismatch;
 found   : A
 required: A.this.type
       class A(i: Int) { def copy(i: Int = this.i): this.type = new A(i) }
                                                                ^
A subclass could choose not to inherit copy and that would be a type error.
Tim Poterba
@tpoterba
Aug 18 2016 21:51
ah
hmm
cseed
@cseed
Aug 18 2016 21:51
Yeah. This is a bit hard to do nicely.
You could have a simple BKM class (BKMValues?) that just has the BKM fields. Then you could have a BKMOps trait which has an a method to copy and another one to get the BKMValues, then VSM would implement BKMValues but store a BKMData. And BKM would store the values and implement the trait, too.
Tim Poterba
@tpoterba
Aug 18 2016 21:54
hmmmm
what’s BKMData?
cseed
@cseed
Aug 18 2016 21:55
And BKMOps woudl be BKMOps[T] where T is the return value of the copy method.
Tim Poterba
@tpoterba
Aug 18 2016 21:56
this is the generic BKM metadata?
cseed
@cseed
Aug 18 2016 21:56
BKMData is just a container for the rdd, saType, etc.
No, BKM has no metadata.
Let me sketch:
Tim Poterba
@tpoterba
Aug 18 2016 21:56
sure
cseed
@cseed
Aug 18 2016 22:03
class BKMData[R, C, E](rdd: RDD[(R, (Annotation, Iterable[E]))], saType: Type, …)

trait BKMOps[R, C, E, T] {
  def data: BKMData[R, C, E]
  // abstract with defaults
  def newSelf(rdd = data.rdd, saType = data.saType, …)

  // implemented
  def mapValues… = ...
  def aggregateByRows… = ...
}

// data implemented by val
class BKM[R, C, E](val data: BKMData[R, C, E]) implements BKMOps[R, C, E, BKM[R, C, E]] {
  def newSelf(rdd: RDD[…], saType: Type, …) = new BKM(data.copy(rdd, saType, …)
}

class VSM(data: BKMData[Variant, Sample, E], metadata: VariantMetadata) implements BKMOps[Variant, Sample, E, VSM[E]] {
  def newSelf…
  def copy(rdd = data.rdd, metadata = this.metadata, …) = new VSM(data.copy(…), metadata, …)
}
Tim Poterba
@tpoterba
Aug 18 2016 22:03
ohhhh
cseed
@cseed
Aug 18 2016 22:03
In particular, VSM.newSelf makes a new VSM with a copy of the BKMData but teh same metadata.
Tim Poterba
@tpoterba
Aug 18 2016 22:03
yeah
cseed
@cseed
Aug 18 2016 22:04
I think this might work!
Tim Poterba
@tpoterba
Aug 18 2016 22:04
it’s newSelf: T right?
cseed
@cseed
Aug 18 2016 22:04
Yes, newSelf: T
Tim Poterba
@tpoterba
Aug 18 2016 22:05
also, you’re using BKM. Do you prefer BiKeyedMatrix to BiKeyedRDD?
cseed
@cseed
Aug 18 2016 22:05
Ah, no, I prefer BiKeyedRDD.
BKR.
BKRDD?
Tim Poterba
@tpoterba
Aug 18 2016 22:05
alright
cseed
@cseed
Aug 18 2016 22:06
Bit convoluted just to subclass something.
Tim Poterba
@tpoterba
Aug 18 2016 22:06
BKRDDOps looks a bit much though
yeah.
cseed
@cseed
Aug 18 2016 22:06
The implementation of everything is in BKMOps.
Tim Poterba
@tpoterba
Aug 18 2016 22:08
yep
unrelated question: what is this
package org.broadinstitute.hail.methods

import org.broadinstitute.hail.variant._

object sSingletonVariants {
  def apply(vds: VariantDataset): Set[Variant] = {
    vds
      .mapValues(g => if (g.isCalledNonRef) 1 else 0)
      .foldByVariant(0)(_ + _)
      .filter(_._2 == 1)
      .keys
      .collect().toSet
  }
}
cseed
@cseed
Aug 18 2016 22:08
Ah, and BKM should implement copy, too.
Tim Poterba
@tpoterba
Aug 18 2016 22:08
yeah
cseed
@cseed
Aug 18 2016 22:09
That’s still in there? Amazing. That’s like from the first week working on Hail.
It isn’t used. Kill it (in it’s own PR!)
Daniel King
@danking
Aug 18 2016 22:11
For the random testing, high size, low count doesn't work out well for things that are naturally very small (e.g. Genotypes)
Tim Poterba
@tpoterba
Aug 18 2016 22:32
@cseed the zero value serialize/deserialize nonsense is fixed with lazy argument evaluation :thumbsup: