Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Michael L Heuer
@heuermh
> printSchema(exacDF)
root
 |-- contigName: string (nullable = true)
 |-- start: long (nullable = true)
 |-- end: long (nullable = true)
 |-- names: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAllele: string (nullable = true)
 |-- filtersApplied: boolean (nullable = true
 |-- filtersPassed: boolean (nullable = true)
 |-- filtersFailed: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- annotation: struct (nullable = true)
 |    |-- ancestralAllele: string (nullable = true)
 |    |-- alleleCount: integer (nullable = true)
 |    |-- readDepth: integer (nullable = true)
 |    |-- forwardReadDepth: integer (nullable = true)
 |    |-- reverseReadDepth: integer (nullable = true)
 |    |-- referenceReadDepth: integer (nullable = true)
 |    |-- referenceForwardReadDepth: integer (nullable = true)
 |    |-- referenceReverseReadDepth: integer (nullable = true)
 |    |-- alleleFrequency: float (nullable = true)
 |    |-- cigar: string (nullable = true)
 |    |-- dbSnp: boolean (nullable = true)
 |    |-- hapMap2: boolean (nullable = true)
 |    |-- hapMap3: boolean (nullable = true)
 |    |-- validated: boolean (nullable = true)
 |    |-- thousandGenomes: boolean (nullable = true)
 |    |-- somatic: boolean (nullable = true)
 |    |-- transcriptEffects: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- alternateAllele: string (nullable = true)
 |    |    |    |-- effects: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- geneName: string (nullable = true)
 |    |    |    |-- geneId: string (nullable = true)
 |    |    |    |-- featureType: string (nullable = true)
 |    |    |    |-- featureId: string (nullable = true)
 |    |    |    |-- biotype: string (nullable = true)
 |    |    |    |-- rank: integer (nullable = true)
 |    |    |    |-- total: integer (nullable = true)
 |    |    |    |-- genomicHgvs: string (nullable = true)
 |    |    |    |-- transcriptHgvs: string (nullable = true)
 |    |    |    |-- proteinHgvs: string (nullable = true)
 |    |    |    |-- cdnaPosition: integer (nullable = true)
 |    |    |    |-- cdnaLength: integer (nullable = true)
 |    |    |    |-- cdsPosition: integer (nullable = true)
 |    |    |    |-- cdsLength: integer (nullable = true)
 |    |    |    |-- proteinPosition: integer (nullable = true)
 |    |    |    |-- proteinLength: integer (nullable = true)
 |    |    |    |-- distance: integer (nullable = true)
 |    |    |    |-- messages: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |-- attributes: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
shoumitrabala
@shoumitrabala
Hi Guys. I am new to this ADAM. Want to explore this platform. From where can I start?
Michael L Heuer
@heuermh
@shoumitrabala Know homebrew or conda?
Anton Kulaga
@antonkulaga
Any estimates for the next release? Last one was on Apr 11 and I am forced to use latest master because of some critical fixes there
Michael L Heuer
@heuermh
Yesterday, I wish!
Here is the 0.23.0 Milestone https://github.com/bigdatagenomics/adam/milestone/16
Unfortunately we open new issues and pull requests faster than we close them.
Anton Kulaga
@antonkulaga
It is a lot. Maybe you can move some issues to the next milestone and release what is there?
Michael L Heuer
@heuermh
Perhaps there is some additional triage we could do. Many have to do with the release itself; by my count 14 are around documentation and 8 are about packaging for python and R, both new to this release.
Ryan Williams
@ryan-williams
@antonkulaga are you using a -SNAPSHOT release in the meantime?
Anton Kulaga
@antonkulaga
I just build from source and publishLocal
If you think SNAPSHOT is better, then what should I add to my build.sbt ?
Michael L Heuer
@heuermh

You need the sbt equivalent of this pom snippet

    <repositories>
        <repository>
            <id>sonatype-nexus-snapshots</id>
            <name>Sonatype Nexus Snapshots</name>
            <url>https://oss.sonatype.org/content/repositories/snapshots</url>
            <releases>
                <enabled>false</enabled>
            </releases>
            <snapshots>
                <enabled>true</enabled>
            </snapshots>
        </repository>
    </repositories>

or extend from org.sonatype.oss:oss-parent:${version} pom, if that makes sense in sbt

Anton Kulaga
@antonkulaga
I see only "utils" there
sorry, did not notice the scrollbar =)
I see that the latest for adam-core-spark2_2.11 is 0.22.0
looks like it is equal to latest release, or I search in a wrong way...
Michael L Heuer
@heuermh
That search interface only indexes the releases repository, the snapshots (built nightly by our CI Jenkins) are deployed to the separate Sonatype Nexus Snapshots repository.
Ryan Williams
@ryan-williams
resolvers += Resolver.sonatypeRepo("snapshots")
adds SNAPSHOT resolution to an SBT build
Daniel Almeida
@daniat87
Hi guys, could you give me some guidelines to start working for adam project. I would like to do some scala/spark development. How do you test your changes, locally in pseudo-distributed mode or remotely against a cluster? I appreciate your help, thanks!
Michael L Heuer
@heuermh
@daniat87 All of the above :)
I personally test things locally, rely on our Jenkins build (scripts/jenkins_test) for cross-Spark and cross-Scala checks, and then do Real Work on AWS using cgcloud https://github.com/BD2KGenomics/cgcloud, Toil http://toil.readthedocs.io/en/latest/install/cloud.html#toil-provisioner, or EMR https://aws.amazon.com/emr/
Michael L Heuer
@heuermh

@daniat87 @shoumitrabala To get started with the development build (git HEAD), first install Apache Spark and Apache Maven

$ brew install maven apache-spark

Then clone or fork+clone the ADAM repository, move to Spark 2.x/Scala 2.11 to match the Spark version, build with Maven, and run bin/adam-submit or bin/adam-shell

$ git clone https://github.com/bigdatagenomics/adam.git
$ cd adam
$ ./scripts/move_to_spark_2.sh
$ ./scripts/move_to_scala_2.11.sh
$ mvn clean install
$ ./bin/adam-shell
There are a few issues marked with the "pick me up!" label if you want something to work on
Daniel Almeida
@daniat87
Thanks a lot @heuermh
Michael L Heuer
@heuermh
@tangxuan_twitter Do you have a VCF file with -Inf or +Inf values in it that you could excerpt from for a unit test case? See bigdatagenomics/adam#1721
Ryan Williams
@ryan-williams
is the weekly call happening?
Devin Petersohn
@devin-petersohn
I don't have a call link, but I would like to have the call if there is one.
Hey guys, I am having a build issue I can't seem to resolve. I am trying to build a downstream application that depends on ADAM on Spark 2. Is anyone familiar with a fix for this error?
[ERROR] error: missing or invalid dependency detected while loading class file 'GenomicDataset.class'.
[INFO] Could not access term package in package org.apache.spark.sql,
[INFO] because it (or its dependencies) are missing. Check your build definition for
[INFO] missing or conflicting dependencies.
I tried with Spark 2.1 and 2.2, and building ADAM against both.
Devin Petersohn
@devin-petersohn
Clarification: This does not happen when I build ADAM, rather when I build the app that depends on ADAM.
Ryan Williams
@ryan-williams
not a deep insight but i'd look at mvn dependency:tree and try to see where/whether conflicting versions of GenomicDataset or Spark might be creeping in?
Michael L Heuer
@heuermh
@devin-petersohn You may need to add a direct dependency on spark-sql, provided scope. I found I needed to do this in downstream apps.
we need a new meeting notice, I'll create one starting next week
Devin Petersohn
@devin-petersohn
@heuermh Fixed the issue. What happened was I had spark-sql in the pom.xml for the parent project, but not the core/cli pom.xml files. Thanks for the help.
Pierre Lindenbaum
@lindenb
Hi all, I'm very new to spark and adam, but I'm a regular user of the htsjdk library. I'm trying to understand how adam works ('parquet', etc...). I'm trying to understand how I could should spark to create a RDD<VariantContext>. My main problem is that you often need the VCFHeader to decode the semantics of the INFO column (e.g: VEP annotations), and thus how to bind a VCFHeader to a RDD. As far as I understand, adam as combined the header and the variant in the same object (Am I wrong ?). Isn't it too big ? Furthermore, what would happen if there is no variant (==no header) and I want to write a new valid VCF at the end ? so many questions...
Michael L Heuer
@heuermh

Hello @lindenb! RDD<VariantContext> isn't an abstraction very useful to users of ADAM, rather you should prefer VariantRDD or GenotypeRDD.

http://bdgenomics.org/adam/latest/scaladocs/index.html#org.bdgenomics.adam.rdd.variant.VariantRDD
http://bdgenomics.org/adam/latest/scaladocs/index.html#org.bdgenomics.adam.rdd.variant.GenotypeRDD

We handle the metadata (VCF header, samples, and sequence dictionary), and VCF INFO column decoding/encoding. The doc in the Avro schema details the mapping, which has one additional complication, in that we split multi-allelic variant rows in VCF into separate Variant records.

VEP annotations (since you mentioned them) are mapped to an array of TranscriptEffects.
Michael L Heuer
@heuermh

A good way to learn the ADAM APIs would be to start a shell with adam-shell and load up a VCF file

import org.bdgenomics.adam.rdd.ADAMContext._
val variants = sc.loadVariants("my-sites-only.vcf")
val variant = variants.rdd.first()
val genotypes = sc.loadGenotypes("my-genotypes.vcf")
val genotype = genotypes.rdd.first()

then you can tab-complete the various variables and see what fields and methods are available

Pierre Lindenbaum
@lindenb
@heuermh thanks for the links !, I'll have a look ! :-)
Jerry Liu
@jerryliu2005
Hello - Where can I find docs for the Adam Python API? Thx!
Michael L Heuer
@heuermh
@jerryliu2005 I believe we'll push to PyPI at the next release, for the time being let the Source be your guide
Jerry Liu
@jerryliu2005
@heuermh Thanks, good to know!
Jerry Liu
@jerryliu2005
Hello - where can I find info on using Adam to load multiple individual-sample gvcfs, merge, and then output a multi-sample VCF? Thx!
Devin Petersohn
@devin-petersohn
@jerryliu2005 Are you using the Python API to union those files?
Michael L Heuer
@heuermh
@jerryliu2005 You can use globs in loadGenotypes which unions all of the Genotype records read, e.g. sc.loadGenotypes("/data/**.vcf") That is not the same as merging to a multi-sample gVCF though, see bigdatagenomics/adam#1312
Jerry Liu
@jerryliu2005
@devin-petersohn I'm using Scala at this point. I'm still learning my ropes of Adam here.
Jerry Liu
@jerryliu2005
@heuermh by union do you mean to concat multiple vcfs from the same sample like from different chromosomes? The bigdatagenomics/adam#1312 asked the same question I was asking but only points to a link to Frank's jointcall implementation in avocado. At this point I'm not looking to recall genomes but just want to merge (add) a new gVCF to previous merged multi-sample gvcf. I am wondering if Adam has an easy way to do this.
pkothiyal
@pkothiyal
Hi all, I have another question on merging gVCFs. I ran a small test using loadGenotypes on a few gVCFs, saving each resultant GenotypeRDD as Parquet, and then reading in all Parquet files as a SQL context so I can try out some queries. As expected, only sites that are variant in at least one of the samples are saved during the conversion. Is there a workaround to back-fill hom-ref calls across the samples for sites that are variant in the union RDD? We are just trying to figure out a way to avoid merging thousands of gVCFs before being able to use Adam. Thanks.
Michael L Heuer
@heuermh
@pkothiyal Ref-only sites are not lost when reading from gVCF or writing to Parquet, at least not that I can see. With one of our test resources in adam-shell
scala> import org.bdgenomics.adam.rdd.ADAMContext._

scala> val genotypes = sc.loadGenotypes("adam-core/src/test/resources/gvcf_dir/gvcf_multiallelic.g.vcf")

scala> genotypes.rdd.count()
res0: Long = 6

scala> genotypes.rdd.foreach(g => println(Array(g.contigName, g.start, g.end, g.variant.referenceAllele, g.variant.alternateAllele, g.alleles).mkString("\t")))
chr22    16157520    16157602    C    null    [REF, REF]
chr22    16157602    16157603    G    C    [ALT, ALT]
chr22    16157603    16157639    G    null    [REF, REF]
chr22    18030095    18030099    TAAA    T    [OTHER_ALT, OTHER_ALT]
chr22    18030095    18030099    TAAA    TA    [ALT, OTHER_ALT]
chr22    18030095    18030099    TAAA    TAA    [OTHER_ALT, ALT]

scala> genotypes.saveAsParquet("gvcf.genotypes.adam")

scala> val parquetGenotypes = sc.loadGenotypes("gvcf.genotypes.adam")

scala> parquetGenotypes.rdd.count()
res8: Long = 6

scala> parquetGenotypes.rdd.foreach(g => println(Array(g.contigName, g.start, g.end, g.variant.referenceAllele, g.variant.alternateAllele, g.alleles).mkString("\t")))
chr22    16157520    16157602    C    null    [REF, REF]
chr22    16157602    16157603    G    C    [ALT, ALT]
chr22    16157603    16157639    G    null    [REF, REF]
chr22    18030095    18030099    TAAA    T    [OTHER_ALT, OTHER_ALT]
chr22    18030095    18030099    TAAA    TA    [ALT, OTHER_ALT]
chr22    18030095    18030099    TAAA    TAA    [OTHER_ALT, ALT]
Michael L Heuer
@heuermh
When you union gVCF genotypes together, nothing clever happens with ref-only sites
scala> val edit = sc.loadGenotypes("adam-core/src/test/resources/gvcf_dir/gvcf_multiallelic.g.edit.vcf")

scala> edit.rdd.foreach(g => println(Array(g.contigName, g.start, g.end, g.variant.referenceAllele, g.variant.alternateAllele, g.alleles).mkString("\t")))
chr22    16157520    16157639    C    null    [REF, REF]
chr22    18030095    18030099    TAAA    T    [OTHER_ALT, OTHER_ALT]
chr22    18030095    18030099    TAAA    TA    [ALT, OTHER_ALT]
chr22    18030095    18030099    TAAA    TAA    [OTHER_ALT, ALT]

scala> edit.saveAsParquet("edit.genotypes.adam")

scala> val union = sc.loadGenotypes("*.genotypes.adam/*")

scala> union.rdd.count()
res12: Long = 10

scala> union.rdd.foreach(g => println(Array(g.contigName, g.start, g.end, g.variant.referenceAllele, g.variant.alternateAllele, g.alleles, g.sampleId).mkString("\t")))
chr22    16157520    16157602    C    null    [REF, REF]    NA12878i
chr22    16157602    16157603    G    C    [ALT, ALT]    NA12878i
chr22    16157603    16157639    G    null    [REF, REF]    NA12878i
chr22    18030095    18030099    TAAA    T    [OTHER_ALT, OTHER_ALT]    NA12878i
chr22    18030095    18030099    TAAA    TA    [ALT, OTHER_ALT]    NA12878i
chr22    18030095    18030099    TAAA    TAA    [OTHER_ALT, ALT]    NA12878i
chr22    16157520    16157639    C    null    [REF, REF]    EDIT
chr22    18030095    18030099    TAAA    T    [OTHER_ALT, OTHER_ALT]    EDIT
chr22    18030095    18030099    TAAA    TA    [ALT, OTHER_ALT]    EDIT
chr22    18030095    18030099    TAAA    TAA    [OTHER_ALT, ALT]    EDIT