These are chat archives for nextflow-io/nextflow
@pditommaso That is one of the ADAM use cases, with data saved in Parquet format on S3, you can query via pushdown filter predicates
import org.bdgenomics.adam.rdd.ADAMContext._ import org.apache.parquet.filter2.dsl.Dsl._ import org.apache.parquet.filter2.predicate.FilterPredicate val filtersPassed: FilterPredicate = (BooleanColumn("filtersPassed") === true) val variants = sc.loadParquetVariants("s3a://bucket/1000g.variants.adam", Some(filtersPassed)) variants.rdd.count()
Pushdown filters on S3 using Spark SQL is somewhat experimental at the moment; we'll support it as it matures.
Queries on native data formats stored on S3 would have to happen after the data have been retrieved; you can lessen the hit with indexes (e.g. bai, tbi) and region filters
import org.bdgenomics.adam.rdd.ADAMContext._ import org.bdgenomics.adam.models.ReferenceRegion val region = ReferenceRegion.fromStart("1", 100000L) val variants = sc.loadIndexedVcf("s3a://bucket/1000g.vcf.gz", region) val filteredVariants = variants.transform(rdd => rdd.filter(_.filtersPassed)) filteredVariants.rdd.count()
Or as one of our collaborators does, use ADAM to convert from native data formats to Parquet, and then load Parquet into Amazon Athena for querying by SQL
import org.bdgenomics.adam.rdd.ADAMContext._ val alignments = sc.loadAlignmentRecords("s3a://bucket/NA12878.bam") alignments.saveAsParquet("s3a://bucket/NA12878.alignments.adam")
For doing this from NF, first it helps to have a Spark cluster, which in itself is a conversation we've been having for some time ;) Then even though the examples above are all in Scala, most of our Scala APIs are callable from Java (and Groovy by extension) and for those APIs that aren't, we've added separate Java APIs.
GenomicRDDs which can be backed by RDDs or Dataset/DataFrames, depending on how you interact with them.
groupByusage? I have my associative array, I'm just not sure how to use it afterward.