A genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.
heuermh on master
Add paired fastq argument to tr… (compare)
heuermh on master
Update guava to 30.1.1-jre, com… Update maven plugin dependency … (compare)
when I use this code to generate sam file
val fragments = sc.loadFastq("36_11.fastq", Option("36_22.fastq"))
fragments.saveAsSam("test.sam", SAMFormat.SAM, true, false)
I got
@hd VN:1.6 SO:unsorted
SRR8418436.2 2 77 0 0 0 0 GGAAATTTAAAAAAATACACATGGCCAGGCCCCAGCCCAAATCACTAATAAGAATCTCCAGGGCTTCACCTGTTAGACTGGCAAAAATCCAAAAGTAAACA @@@bda:?FHHHDHGFHBF9A?EGE@FHGBG@DHGE;;B2=@GIDGDCFHHIBGHIIEEFC8?@b(.;CA>>>;ACC<<?CCB9?9::>:>@C:@@cc>
SRR8418436.2 2 141 0 0 0 0 CACCAGCAATGTGTAGGAATACCTGTTTCTCCACAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCTTATTAGTGATTT ?@BDDABCCBFDHA9F@E@FHDEBEHI@DEBDAHIII1:D@GB@FHIB?@??DGDHGCGB==C>;F@@C=C;??E=CBB?;ACACA35@@
but when I use another tool (example picard) I got
@hd VN:1.6 SO:queryname
@rg ID:A SM:1
SRR8418436.2 77 0 0 0 0 GGAAATTTAAAAAAATACACATGGCCAGGCCCCAGCCCAAATCACTAATAAGAATCTCCAGGGCTTCACCTGTTAGACTGGCAAAAATCCAAAAGTAAACA @@@bda:?FHHHDHGFHBF9A?EGE@FHGBG@DHGE;;B2=@GIDGDCFHHIBGHIIEEFC8?@b(.;CA>>>;ACC<<?CCB9?9::>:>@C:@@cc> RG:Z:A
SRR8418436.2 141 0 0 0 0 CACCAGCAATGTGTAGGAATACCTGTTTCTCCACAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCTTATTAGTGATTT ?@BDDABCCBFDHA9F@E@FHDEBEHI@DEBDAHIII1:D@GB@FHIB?@??DGDHGCGB==C>;F@@C=C;??E=CBB?;ACACA35@@ RG:Z:A
I think there are extra column
SRR8418436.2 2
---------------^
I want to know how to do to omit this column.
Alignment
s, then you could use loadAlignments(Dataframe, ...)
import org.bdgenomics.adam.ds.ADAMContext._
import org.bdgenomics.adam.models.ReferenceFile
val ref = sc.loadReferenceFile("reference.fa", 10000L)
val features = sc.loadFeatures("features.bed", Some(ref.references))
val broadcastRef = sc.broadcast(ref)
val sequences = features.rdd.map(f => (f, broadcastRef.value.extract(ReferenceRegion.unstranded(f))))
val sequences = features.rdd.map(f => (f, broadcastRef.value.extract(ReferenceRegion.stranded(f))))
adam-shell
val alignments = ...
val df = alignments.toDF
df.createOrReplaceTempView("alignments")
> spark.sql("select count(*) from alignments").show()
> spark.sql("select count(*) from alignments where mappingQuality > 40").show()
> spark.sql("select count(*) from alignments where readMapped = true").show()
SELECT pos, ref, alt... FROM alignments OUTER JOIN reference ON alignments.reference = reference.id
Would we expect those ratios to hold even for terabyte or petabyte-scale datasets? My understanding of how Parquet works is that it's good at exactly that: compressing data points that are repeated.
My understanding of how Parquet works is that it's good at exactly that: compressing data points that are repeated.
Yes, exactly. You can run parquet-tools meta
on any dataset written out in Parquet format by ADAM and it will provide you columnwise compression statistics.
$ parquet-tools meta NA12878.alignedHg38.duplicateMarked.baseRealigned.alignments.adam
...
row group 1: RC:61614 TS:55357462 OFFSET:4
--------------------------------------------------------------------------------
referenceName: BINARY GZIP DO:4 FPO:942 SZ:1729/6586/3.81 VC:61614 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: chr1, max: chrY, num_nulls: 442]
start: INT64 GZIP DO:0 FPO:1733 SZ:184263/489524/2.66 VC:61614 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 0, max: 248940527, num_nulls: 442]
originalStart: INT64 GZIP DO:0 FPO:185996 SZ:210/130/0.62 VC:61614 ENC:PLAIN,BIT_PACKED,RLE ST:[num_nulls: 61614, min/max not defined]
end: INT64 GZIP DO:0 FPO:186206 SZ:185273/489523/2.64 VC:61614 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 57, max: 248940776, num_nulls: 442]
mappingQuality: INT32 GZIP DO:371479 FPO:371610 SZ:9210/17304/1.88 VC:61614 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: 0, max: 60, num_nulls: 442]
readName: BINARY GZIP DO:380689 FPO:496636 SZ:277806/1647133/5.93 VC:61614 ENC:PLAIN,PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: H06HDADXX130110:1:1101:10000:100394, max: H06JUADXX130110:1:1101:10089:47691, num_nulls: 0]
...
How should I think about ADAM and Disq (or Hadoop-BAM or Spark-BAM)? Is it correct to say that these libraries are responsible for reading FASTQ/BAM/CRAM/VCF data into Spark RDDs, but not for converting into a Spark Dataset or writing to Parquet (and thus do not currently use or overlap with the bdg-formats schemas)
From the benchmarks and comments earlier in this channel, it seems like Disq is what I want to use to deal with these genomics files, but I want to confirm that I'm thinking of the responsibilities of the two correctly and that Disq is indeed the right horse to back adoption- and roadmap-wise.
GenomicDataset
abstraction can realize in RAM either Spark RDDs of Java objects defined by the bdg-formats schemas or Spark Datasets of Scala case classes also defined by the bdg-formats schemas.
SAMRecord
) into bdg-formats objects (e.g. Alignment
) and back. Disq also works with htsjdk Java objects, so the conversion step is still necessary. ADAM reads or writes FASTA, FASTQ, BED6/12, GFF3, GTF/GFF2, NarrowPeak, and IntervalList formats directly. For Genbank there is a biojava-adam library.
JavaRDD<SAMRecord>
. If you want to use RDD[Alignment]
or Dataset[Alignment]
or DataFrame
from Scala, Java, Python, or R and/or read from or write to Parquet in bdg-formats schemas on disk, then you'll need ADAM.
AlignmentDataset.recalibrateBaseQualities
.Method doc:
We consider a read valid for use in generating the recalibration table if it is
a "proper" read. That is to say, it is the canonical read (mapped as a primary
alignment with non-zero mapQ and not marked as a duplicate), it did not fail
vendor checks, and the quality and CIGAR are defined.
ObservationTable
(if that is what you are looking for) is defined heretoCSV
methodbcftools
or vt