Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Michael L Heuer
@heuermh
FYI, I've created such a new module for using Disq to load BAM, CRAM, and VCF
https://github.com/heuermh/disq-adam
I'm not sure yet where best it should live, in the adam repository, in another repo under the bigdatagenomics organization, or in another repo under the disq-bio organization.
1 reply
hxdhan
@hxdhan

when I use this code to generate sam file

val fragments = sc.loadFastq("36_11.fastq", Option("36_22.fastq"))
fragments.saveAsSam("test.sam", SAMFormat.SAM, true, false)
I got

@hd VN:1.6 SO:unsorted
SRR8418436.2 2 77 0 0 0 0 GGAAATTTAAAAAAATACACATGGCCAGGCCCCAGCCCAAATCACTAATAAGAATCTCCAGGGCTTCACCTGTTAGACTGGCAAAAATCCAAAAGTAAACA @@@bda:?FHHHDHGFHBF9A?EGE@FHGBG@DHGE;;B2=@GIDGDCFHHIBGHIIEEFC8?@b(.;CA>>>;ACC<<?CCB9?9::>:>@C:@@cc>
SRR8418436.2 2 141
0 0 0 0 CACCAGCAATGTGTAGGAATACCTGTTTCTCCACAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCTTATTAGTGATTT ?@BDDABCCBFDHA9F@E@FHDEBEHI@DEBDAHIII1:D@GB@FHIB?@??DGDHGCGB==C>;F@@C=C;??E=CBB?;ACACA35@@

but when I use another tool (example picard) I got
@hd VN:1.6 SO:queryname
@rg ID:A SM:1
SRR8418436.2 77 0 0 0 0 GGAAATTTAAAAAAATACACATGGCCAGGCCCCAGCCCAAATCACTAATAAGAATCTCCAGGGCTTCACCTGTTAGACTGGCAAAAATCCAAAAGTAAACA @@@bda:?FHHHDHGFHBF9A?EGE@FHGBG@DHGE;;B2=@GIDGDCFHHIBGHIIEEFC8?@b(.;CA>>>;ACC<<?CCB9?9::>:>@C:@@cc> RG:Z:A
SRR8418436.2 141
0 0 0 0 CACCAGCAATGTGTAGGAATACCTGTTTCTCCACAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCTTATTAGTGATTT ?@BDDABCCBFDHA9F@E@FHDEBEHI@DEBDAHIII1:D@GB@FHIB?@??DGDHGCGB==C>;F@@C=C;??E=CBB?;ACACA35@@ RG:Z:A

I think there are extra column
SRR8418436.2 2
---------------^

I want to know how to do to omit this column.

Michael L Heuer
@heuermh
Thanks, @hxdhan. Replied on issue bigdatagenomics/adam#2321
calvertj
@calvertj
Hello, if I have SAM Header, and a spark dataframe or rdd of SAM data (rows of string, an array of strings, or columns of values, whatever would be best) what would be the best way to get it into a RDD[Alignment] for use with ADAM?
2 replies
thanks!
Michael L Heuer
@heuermh
Hello @calvertj! There are lots of different answers to this question :wink:
10 replies
Best would be if your Spark Dataframe looked like that of Alignments, then you could use loadAlignments(Dataframe, ...)
Where are you starting from?
calvertj
@calvertj
Sorry for being off topic, is there a gitter for bgd cannoli @heuermh ? Thanks.
Michael L Heuer
@heuermh
Here is fine for Cannoli questions
Larry N. Singh
@larryns
Hi everyone, I'd like to announce MitoScape (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009594) which I've used Adam extensively for.
Thanks for all your help.
Michael L Heuer
@heuermh
Very cool, thank you! Let me know if you tweet something about it and I'll retweet from the bigdatagenomics account
Larry N. Singh
@larryns
I'm not on twitter, but my co-author tweeted it: https://twitter.com/DrIsabelLopez/status/1458948183422242839?s=20
Michael L Heuer
@heuermh
Perfect! There is also the https://github.com/bigdatagenomics/awesome-adam repo, if you wanted to add an entry there
Larry N. Singh
@larryns
@heuermh Would love to add an entry. I don't think I can edit the file though?
Michael L Heuer
@heuermh
Ah just create a Github pull request on that repo, or post text here and I can add it
Larry N. Singh
@larryns
Hmm, tried to create a new branch, but I don't think I can since I'm not a collaborator.
Text should probably be:
MitoScape - MitoScape is a big-data, machine-learning workflow for obtaining mtDNA sequence from NGS data.
Michael L Heuer
@heuermh
Sounds good, thanks!
Niru Chennagiri
@cvniru_gitlab
I am trying to extract the fasta sequences for millions of intervals from a bed file. Functionality similar to getfasta in bedtools. Is there a way to do that using ADAM?
Michael L Heuer
@heuermh
Hello @cvniru_gitlab! What does the reference (FASTA format) look like?
Niru Chennagiri
@cvniru_gitlab
I have the standard hg19 fasta file. https://www.bioinformatics.nl/tools/crab_fasta.html
1 reply
Michael L Heuer
@heuermh
Thanks! I asked because we have to broadcast the reference across all the compute nodes. If it were a large number of very small FASTA files perhaps there might be a better way to do it
import org.bdgenomics.adam.ds.ADAMContext._
import org.bdgenomics.adam.models.ReferenceFile

val ref = sc.loadReferenceFile("reference.fa", 10000L)
val features = sc.loadFeatures("features.bed", Some(ref.references))

val broadcastRef = sc.broadcast(ref)
val sequences = features.rdd.map(f => (f, broadcastRef.value.extract(ReferenceRegion.unstranded(f))))
6 replies
Or if you have stranded features
val sequences = features.rdd.map(f => (f, broadcastRef.value.extract(ReferenceRegion.stranded(f))))
Niru Chennagiri
@cvniru_gitlab
Our pipeline currently runs tools such as Picard HsMetrics which can't be distributed because it collects stats on the whole BAM file. If I save partitioned BAM files from my alignment step, how do you suggest I feed that into HsMetrics?
Michael L Heuer
@heuermh
If you want a single BAM file, you can merge the partitions when saving with asSingleFile = true, e.g.
alignments.saveAsSam("merged.bam", asSingleFile = true)
If you would rather collect stats in a parallel/distributed fashion, many metrics would be simple SQL queries, e.g. in adam-shell
val alignments = ...
val df = alignments.toDF
df.createOrReplaceTempView("alignments")

> spark.sql("select count(*) from alignments").show()
> spark.sql("select count(*) from alignments where mappingQuality > 40").show()
> spark.sql("select count(*) from alignments where readMapped = true").show()
nbraid
@nbraid
Hello! My background is in software development, so please forgive my lack of genomics knowledge. I see in https://adam.readthedocs.io/en/latest/benchmarks/storage/ that CRAM is more compact than ADAM/Parquet because it can use references to drop uninteresting chunks of data. Would it be possible to do the same with Parquet? i.e. reconstruct the original sequence data by doing a SELECT pos, ref, alt... FROM alignments OUTER JOIN reference ON alignments.reference = reference.id Would we expect those ratios to hold even for terabyte or petabyte-scale datasets? My understanding of how Parquet works is that it's good at exactly that: compressing data points that are repeated.
Michael L Heuer
@heuermh

My understanding of how Parquet works is that it's good at exactly that: compressing data points that are repeated.

Yes, exactly. You can run parquet-tools meta on any dataset written out in Parquet format by ADAM and it will provide you columnwise compression statistics.

$ parquet-tools meta NA12878.alignedHg38.duplicateMarked.baseRealigned.alignments.adam
...
row group 1:               RC:61614 TS:55357462 OFFSET:4
--------------------------------------------------------------------------------
referenceName:              BINARY GZIP DO:4 FPO:942 SZ:1729/6586/3.81 VC:61614 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: chr1, max: chrY, num_nulls: 442]
start:                      INT64 GZIP DO:0 FPO:1733 SZ:184263/489524/2.66 VC:61614 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 0, max: 248940527, num_nulls: 442]
originalStart:              INT64 GZIP DO:0 FPO:185996 SZ:210/130/0.62 VC:61614 ENC:PLAIN,BIT_PACKED,RLE ST:[num_nulls: 61614, min/max not defined]
end:                        INT64 GZIP DO:0 FPO:186206 SZ:185273/489523/2.64 VC:61614 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 57, max: 248940776, num_nulls: 442]
mappingQuality:             INT32 GZIP DO:371479 FPO:371610 SZ:9210/17304/1.88 VC:61614 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: 0, max: 60, num_nulls: 442]
readName:                   BINARY GZIP DO:380689 FPO:496636 SZ:277806/1647133/5.93 VC:61614 ENC:PLAIN,PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: H06HDADXX130110:1:1101:10000:100394, max: H06JUADXX130110:1:1101:10089:47691, num_nulls: 0]
...
Michael L Heuer
@heuermh
You could drop the sequence column and reconstruct from the reference when necessary, but since ADAM can use column projections and push down predicates when reading from Parquet, there is little harm in keeping those data on disk. They don't need to be read from disk and put into RAM if not necessary
9 replies
nbraid
@nbraid

How should I think about ADAM and Disq (or Hadoop-BAM or Spark-BAM)? Is it correct to say that these libraries are responsible for reading FASTQ/BAM/CRAM/VCF data into Spark RDDs, but not for converting into a Spark Dataset or writing to Parquet (and thus do not currently use or overlap with the bdg-formats schemas)

From the benchmarks and comments earlier in this channel, it seems like Disq is what I want to use to deal with these genomics files, but I want to confirm that I'm thinking of the responsibilities of the two correctly and that Disq is indeed the right horse to back adoption- and roadmap-wise.

Michael L Heuer
@heuermh
ADAM's primary data on-disk storage format is Parquet via the bdg-format schemas. For some use cases, ADAM also supports a range partitioned Parquet format.
ADAM via its GenomicDataset abstraction can realize in RAM either Spark RDDs of Java objects defined by the bdg-formats schemas or Spark Datasets of Scala case classes also defined by the bdg-formats schemas.
Michael L Heuer
@heuermh
When ADAM reads or writes BAM/CRAM/SAM or VCF it uses Hadoop-BAM to convert htsjdk Java objects (e.g. SAMRecord) into bdg-formats objects (e.g. Alignment) and back. Disq also works with htsjdk Java objects, so the conversion step is still necessary. ADAM reads or writes FASTA, FASTQ, BED6/12, GFF3, GTF/GFF2, NarrowPeak, and IntervalList formats directly. For Genbank there is a biojava-adam library.
Michael L Heuer
@heuermh
Disq on its own only makes sense if you want to use htsjdk APIs e.g. JavaRDD<SAMRecord>. If you want to use RDD[Alignment] or Dataset[Alignment] or DataFrame from Scala, Java, Python, or R and/or read from or write to Parquet in bdg-formats schemas on disk, then you'll need ADAM.
5 replies
calvertj
@calvertj
Hello @heuermh , thanks for you help, we are now converting our dataset objects into various ADAM datasets. We have an ADAM AlignmentDataset, and the scientists want to run ADAM's BQSR on it. How would I do that programmatically? Thanks Again.
Michael L Heuer
@heuermh
It is a single method call, AlignmentDataset.recalibrateBaseQualities.
For how to set this up, you can crib from the command line implementation.
1 reply
Chris Nobles
@cnobles
Hi @heuermh , quick question. I was recalibrating some reads using the above process and noticed that the recalibration was not applied to the MAPQ = 0 reads. Am I doing something wrong or is there some rationale behind not applying the recalibration to those reads?
1 reply
Michael L Heuer
@heuermh
Hello @cnobles! I am not sure any such rationale has been documented if it is indeed the case. The BQSR algorithm docs are at
https://adam.readthedocs.io/en/latest/algorithms/bqsr
Chris Nobles
@cnobles
Gotcha, thanks @heuermh . Another quick question, is there an way to pull the covariance matrix from the recalibration?
Michael L Heuer
@heuermh
To write it out?
Michael L Heuer
@heuermh

Method doc:

We consider a read valid for use in generating the recalibration table if it is
a "proper" read. That is to say, it is the canonical read (mapped as a primary
alignment with non-zero mapQ and not marked as a duplicate), it did not fail
vendor checks, and the quality and CIGAR are defined.

https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/ds/read/recalibration/BaseQualityRecalibration.scala#L128

Chris Nobles
@cnobles
Great, thanks for the pointers on the observation table and method, I'll see what I can do. Just as a point on the Method doc you have above, that makes total sense that the recalibration table which carries the covariates would only be generated by "proper" reads. What I was referring to is that the recalibration is only applied to the "proper" reads and not to any of the other reads that would have been marked as not proper. For instance, I'm doing a number of comparisons right now, and I can build my recalibration table using GATK BQSR with only proper reads, but then apply the model (with dinucleotide and cycle covariates) to all the reads that were sequenced, or for a single read group. Trying to figure out how to do this with the ADAM implementation at the moment.
a-li
@a-li
Hello, is variant normalization available in ADAM?
Michael L Heuer
@heuermh
Hello, @a-li! There is some discussion on implementing variant normalization here
bigdatagenomics/adam#1710
As it mentions there, you can run variant normalization on ADAM data with Cannoli and either bcftools or vt