A genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.
heuermh on master
Add convenience method to filte⦠(compare)
> printSchema(exacDF)
root
|-- contigName: string (nullable = true)
|-- start: long (nullable = true)
|-- end: long (nullable = true)
|-- names: array (nullable = true)
| |-- element: string (containsNull = true)
|-- referenceAllele: string (nullable = true)
|-- alternateAllele: string (nullable = true)
|-- filtersApplied: boolean (nullable = true
|-- filtersPassed: boolean (nullable = true)
|-- filtersFailed: array (nullable = true)
| |-- element: string (containsNull = true)
|-- annotation: struct (nullable = true)
| |-- ancestralAllele: string (nullable = true)
| |-- alleleCount: integer (nullable = true)
| |-- readDepth: integer (nullable = true)
| |-- forwardReadDepth: integer (nullable = true)
| |-- reverseReadDepth: integer (nullable = true)
| |-- referenceReadDepth: integer (nullable = true)
| |-- referenceForwardReadDepth: integer (nullable = true)
| |-- referenceReverseReadDepth: integer (nullable = true)
| |-- alleleFrequency: float (nullable = true)
| |-- cigar: string (nullable = true)
| |-- dbSnp: boolean (nullable = true)
| |-- hapMap2: boolean (nullable = true)
| |-- hapMap3: boolean (nullable = true)
| |-- validated: boolean (nullable = true)
| |-- thousandGenomes: boolean (nullable = true)
| |-- somatic: boolean (nullable = true)
| |-- transcriptEffects: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- alternateAllele: string (nullable = true)
| | | |-- effects: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- geneName: string (nullable = true)
| | | |-- geneId: string (nullable = true)
| | | |-- featureType: string (nullable = true)
| | | |-- featureId: string (nullable = true)
| | | |-- biotype: string (nullable = true)
| | | |-- rank: integer (nullable = true)
| | | |-- total: integer (nullable = true)
| | | |-- genomicHgvs: string (nullable = true)
| | | |-- transcriptHgvs: string (nullable = true)
| | | |-- proteinHgvs: string (nullable = true)
| | | |-- cdnaPosition: integer (nullable = true)
| | | |-- cdnaLength: integer (nullable = true)
| | | |-- cdsPosition: integer (nullable = true)
| | | |-- cdsLength: integer (nullable = true)
| | | |-- proteinPosition: integer (nullable = true)
| | | |-- proteinLength: integer (nullable = true)
| | | |-- distance: integer (nullable = true)
| | | |-- messages: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| |-- attributes: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
You need the sbt equivalent of this pom snippet
<repositories>
<repository>
<id>sonatype-nexus-snapshots</id>
<name>Sonatype Nexus Snapshots</name>
<url>https://oss.sonatype.org/content/repositories/snapshots</url>
<releases>
<enabled>false</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
or extend from org.sonatype.oss:oss-parent:${version} pom, if that makes sense in sbt
resolvers += Resolver.sonatypeRepo("snapshots")
adds SNAPSHOT resolution to an SBT build
scripts/jenkins_test
) for cross-Spark and cross-Scala checks, and then do Real Work on AWS using cgcloud https://github.com/BD2KGenomics/cgcloud, Toil http://toil.readthedocs.io/en/latest/install/cloud.html#toil-provisioner, or EMR https://aws.amazon.com/emr/
@daniat87 @shoumitrabala To get started with the development build (git HEAD), first install Apache Spark and Apache Maven
$ brew install maven apache-spark
Then clone or fork+clone the ADAM repository, move to Spark 2.x/Scala 2.11 to match the Spark version, build with Maven, and run bin/adam-submit
or bin/adam-shell
$ git clone https://github.com/bigdatagenomics/adam.git
$ cd adam
$ ./scripts/move_to_spark_2.sh
$ ./scripts/move_to_scala_2.11.sh
$ mvn clean install
$ ./bin/adam-shell
[ERROR] error: missing or invalid dependency detected while loading class file 'GenomicDataset.class'.
[INFO] Could not access term package in package org.apache.spark.sql,
[INFO] because it (or its dependencies) are missing. Check your build definition for
[INFO] missing or conflicting dependencies.
RDD<VariantContext>
. My main problem is that you often need the VCFHeader to decode the semantics of the INFO column (e.g: VEP annotations), and thus how to bind a VCFHeader to a RDD. As far as I understand, adam as combined the header and the variant in the same object (Am I wrong ?). Isn't it too big ? Furthermore, what would happen if there is no variant (==no header) and I want to write a new valid VCF at the end ? so many questions...
Hello @lindenb! RDD<VariantContext>
isn't an abstraction very useful to users of ADAM, rather you should prefer VariantRDD
or GenotypeRDD
.
http://bdgenomics.org/adam/latest/scaladocs/index.html#org.bdgenomics.adam.rdd.variant.VariantRDD
http://bdgenomics.org/adam/latest/scaladocs/index.html#org.bdgenomics.adam.rdd.variant.GenotypeRDD
We handle the metadata (VCF header, samples, and sequence dictionary), and VCF INFO column decoding/encoding. The doc in the Avro schema details the mapping, which has one additional complication, in that we split multi-allelic variant rows in VCF into separate Variant
records.
TranscriptEffect
s.
A good way to learn the ADAM APIs would be to start a shell with adam-shell
and load up a VCF file
import org.bdgenomics.adam.rdd.ADAMContext._
val variants = sc.loadVariants("my-sites-only.vcf")
val variant = variants.rdd.first()
val genotypes = sc.loadGenotypes("my-genotypes.vcf")
val genotype = genotypes.rdd.first()
then you can tab-complete the various variables and see what fields and methods are available
adam-shell
scala> import org.bdgenomics.adam.rdd.ADAMContext._
scala> val genotypes = sc.loadGenotypes("adam-core/src/test/resources/gvcf_dir/gvcf_multiallelic.g.vcf")
scala> genotypes.rdd.count()
res0: Long = 6
scala> genotypes.rdd.foreach(g => println(Array(g.contigName, g.start, g.end, g.variant.referenceAllele, g.variant.alternateAllele, g.alleles).mkString("\t")))
chr22 16157520 16157602 C null [REF, REF]
chr22 16157602 16157603 G C [ALT, ALT]
chr22 16157603 16157639 G null [REF, REF]
chr22 18030095 18030099 TAAA T [OTHER_ALT, OTHER_ALT]
chr22 18030095 18030099 TAAA TA [ALT, OTHER_ALT]
chr22 18030095 18030099 TAAA TAA [OTHER_ALT, ALT]
scala> genotypes.saveAsParquet("gvcf.genotypes.adam")
scala> val parquetGenotypes = sc.loadGenotypes("gvcf.genotypes.adam")
scala> parquetGenotypes.rdd.count()
res8: Long = 6
scala> parquetGenotypes.rdd.foreach(g => println(Array(g.contigName, g.start, g.end, g.variant.referenceAllele, g.variant.alternateAllele, g.alleles).mkString("\t")))
chr22 16157520 16157602 C null [REF, REF]
chr22 16157602 16157603 G C [ALT, ALT]
chr22 16157603 16157639 G null [REF, REF]
chr22 18030095 18030099 TAAA T [OTHER_ALT, OTHER_ALT]
chr22 18030095 18030099 TAAA TA [ALT, OTHER_ALT]
chr22 18030095 18030099 TAAA TAA [OTHER_ALT, ALT]
scala> val edit = sc.loadGenotypes("adam-core/src/test/resources/gvcf_dir/gvcf_multiallelic.g.edit.vcf")
scala> edit.rdd.foreach(g => println(Array(g.contigName, g.start, g.end, g.variant.referenceAllele, g.variant.alternateAllele, g.alleles).mkString("\t")))
chr22 16157520 16157639 C null [REF, REF]
chr22 18030095 18030099 TAAA T [OTHER_ALT, OTHER_ALT]
chr22 18030095 18030099 TAAA TA [ALT, OTHER_ALT]
chr22 18030095 18030099 TAAA TAA [OTHER_ALT, ALT]
scala> edit.saveAsParquet("edit.genotypes.adam")
scala> val union = sc.loadGenotypes("*.genotypes.adam/*")
scala> union.rdd.count()
res12: Long = 10
scala> union.rdd.foreach(g => println(Array(g.contigName, g.start, g.end, g.variant.referenceAllele, g.variant.alternateAllele, g.alleles, g.sampleId).mkString("\t")))
chr22 16157520 16157602 C null [REF, REF] NA12878i
chr22 16157602 16157603 G C [ALT, ALT] NA12878i
chr22 16157603 16157639 G null [REF, REF] NA12878i
chr22 18030095 18030099 TAAA T [OTHER_ALT, OTHER_ALT] NA12878i
chr22 18030095 18030099 TAAA TA [ALT, OTHER_ALT] NA12878i
chr22 18030095 18030099 TAAA TAA [OTHER_ALT, ALT] NA12878i
chr22 16157520 16157639 C null [REF, REF] EDIT
chr22 18030095 18030099 TAAA T [OTHER_ALT, OTHER_ALT] EDIT
chr22 18030095 18030099 TAAA TA [ALT, OTHER_ALT] EDIT
chr22 18030095 18030099 TAAA TAA [OTHER_ALT, ALT] EDIT