These are chat archives for nextflow-io/nextflow

10th
Jan 2018
Paolo Di Tommaso
@pditommaso
Jan 10 2018 12:29
this is interesting
imagine querying a S3 hosted dataset with SQL from a NF script
Alexander Peltzer
@apeltzer
Jan 10 2018 15:14
Indeed interesting stuff!
Maxime Garcia
@MaxUlysse
Jan 10 2018 15:20
Hi, just in case, I published my notes about how I tried to use AWS with CAW: https://maxulysse.github.io/2017/11/16/Running-CAW-with-AWS-Batch/
Venkat Malladi
@vsmalladi
Jan 10 2018 16:34
@DoaneAS you atac-seq pipeline looks great. Are you plainning on adding IDR?
Michael L Heuer
@heuermh
Jan 10 2018 17:37

@pditommaso That is one of the ADAM use cases, with data saved in Parquet format on S3, you can query via pushdown filter predicates

import org.bdgenomics.adam.rdd.ADAMContext._
import org.apache.parquet.filter2.dsl.Dsl._
import org.apache.parquet.filter2.predicate.FilterPredicate

val filtersPassed: FilterPredicate = (BooleanColumn("filtersPassed") === true)
val variants = sc.loadParquetVariants("s3a://bucket/1000g.variants.adam", Some(filtersPassed))
variants.rdd.count()

Pushdown filters on S3 using Spark SQL is somewhat experimental at the moment; we'll support it as it matures.

Queries on native data formats stored on S3 would have to happen after the data have been retrieved; you can lessen the hit with indexes (e.g. bai, tbi) and region filters

import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.models.ReferenceRegion

val region = ReferenceRegion.fromStart("1", 100000L)
val variants = sc.loadIndexedVcf("s3a://bucket/1000g.vcf.gz", region)
val filteredVariants = variants.transform(rdd => rdd.filter(_.filtersPassed))
filteredVariants.rdd.count()

Or as one of our collaborators does, use ADAM to convert from native data formats to Parquet, and then load Parquet into Amazon Athena for querying by SQL

import org.bdgenomics.adam.rdd.ADAMContext._

val alignments = sc.loadAlignmentRecords("s3a://bucket/NA12878.bam")
alignments.saveAsParquet("s3a://bucket/NA12878.alignments.adam")

For doing this from NF, first it helps to have a Spark cluster, which in itself is a conversation we've been having for some time ;) Then even though the examples above are all in Scala, most of our Scala APIs are callable from Java (and Groovy by extension) and for those APIs that aren't, we've added separate Java APIs.

Paolo Di Tommaso
@pditommaso
Jan 10 2018 17:41
after I wrote that comment, I realised ADAM was somehow related
Michael L Heuer
@heuermh
Jan 10 2018 17:43
We will bring NF and ADAM together at some point :)
Paolo Di Tommaso
@pditommaso
Jan 10 2018 17:43
ahaha
the secret plan ;)
between the other things I was reading Frank dissertation, and he has some critiques for having chosen to depend on Parquet/Avro
are you planning to continue to use it ?
Michael L Heuer
@heuermh
Jan 10 2018 17:46
Yeah, we go back and forth transparently and lazily between Avro-generated Java classes in RDDs and Avro-generated Scala Product classes in Dataset/DataFrames for Spark SQL. Were we to start over from scratch now, we'd probably only use Spark SQL.
Paolo Di Tommaso
@pditommaso
Jan 10 2018 17:48
that means Spark RDD
Michael L Heuer
@heuermh
Jan 10 2018 17:51
incomplete thought?
Paolo Di Tommaso
@pditommaso
Jan 10 2018 17:52
no, I mean likely Spark SQL is implemented on top of the RDD data structure
so I was guessing that an alternative to Parquet would be to use directly Spark RDD
Michael L Heuer
@heuermh
Jan 10 2018 17:58
Oh, no, Datasets/DataFrames have a different implementation than RDDs, see e.g. https://spark.apache.org/docs/2.2.1/sql-programming-guide.html#datasets
We load from Parquet or native file formats to GenomicRDDs which can be backed by RDDs or Dataset/DataFrames, depending on how you interact with them.
Paolo Di Tommaso
@pditommaso
Jan 10 2018 17:59
I see, lot to study about Spark internals
Michael L Heuer
@heuermh
Jan 10 2018 18:00
The hope is that our APIs make it easier to use!
Paolo Di Tommaso
@pditommaso
Jan 10 2018 18:00
os may I should start from ADAM api ..
Michael L Heuer
@heuermh
Jan 10 2018 18:00
:thumbsup:
Paolo Di Tommaso
@pditommaso
Jan 10 2018 18:00
how is the status of Java API ?
does it provide the same feature of Scala?
Michael L Heuer
@heuermh
Jan 10 2018 18:02
You can call most Scala APIs from Java without problem, the Java APIs help with Scala language things like method parameter defaults, Optionals, etc.
Our Python and R APIs still need work to catch up with the Scala & Java APIs
Paolo Di Tommaso
@pditommaso
Jan 10 2018 18:03
ok
Michael L Heuer
@heuermh
Jan 10 2018 18:04
Here are two examples, CountAlignments.scala and JavaCountAlignments.java
Paolo Di Tommaso
@pditommaso
Jan 10 2018 18:06
interesting
Félix C. Morency
@fmorency
Jan 10 2018 19:05
Is there a pipeline somewhere that shows groupBy usage? I have my associative array, I'm just not sure how to use it afterward.
Michael L Heuer
@heuermh
Jan 10 2018 19:31
@MaxUlysse regarding your earlier https://maxulysse.github.io/2017/11/15/Running-CAW-with-Singularity/, is there a resource that builds and hosts Singularity images built from Bioconda recipes?
Alexander Peltzer
@apeltzer
Jan 10 2018 20:10
@heuermh AFAIK no, but you can directly "import" pre-built docker images in Singularity e.g. singularity exec docker://docker-registry/image:releasetag or similar
Evan Floden
@evanfloden
Jan 10 2018 20:55
Adding to this, as of the Nextflow release yesterday, you no longer need docker:// before the container when you use Singularity with NF.