Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Michael L Heuer
@heuermh
Well, Spark 3 isn't out yet ;)
Michael L Heuer
@heuermh
The big problem is the utils-metrics module, which is deeply dependent on internal Spark classes. In the pull request above, I started down the path of removing the ADAM dependency on utils-metrics, but that is a big effort, and losing metrics would be unfortunate
Michael L Heuer
@heuermh
As an update, the utils-metrics module was deprecated in release 0.2.16 and will be removed in version 0.3.0, which will also cross-build Spark 2 and Spark 3.
ADAM version 0.31.0 will be released soon, with deprecations from utils-metrics 0.2.16 and others. Then ADAM version 0.32.0 will depend on utils version 0.3.0 and will also cross-build Spark 2 and Spark 3.
Michael L Heuer
@heuermh
Michael L Heuer
@heuermh
If there are any other issues/pull requests folks would like to see triaged to the 0.31.0 release, please let me know
Karen Feng
@karenfeng
Awesome, thanks Michael. When would you ballpark the release for 0.32.0?
Michael L Heuer
@heuermh
Not exactly sure, bdg-utils 0.3.0 and ADAM 0.32.0 should probably wait for Spark 3.0 proper instead of the preview releases
And I don't have access to any Spark 3 cluster environments for testing, e.g. AWS EMR is still at Spark 2.4.4
jamesthegiantpeach
@jamesthegiantpeach
How do you shard fastq bgz files without having an index? That is, how do you seek into a bgz fastq file and know that a block doesn't have half of a fastq record in it? Couldn't find the code in the repo
Michael L Heuer
@heuermh
Block-gzipped (bgz/bgzf) FASTQ files are supported by the code in this package https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/java/org/bdgenomics/adam/io
Most of it came in on this commit bigdatagenomics/adam@985e5d8
Jashan Goyal
@jashangoyal09
Hi folk, How can i get files in avro format from s3 buket?
Michael L Heuer
@heuermh
Hello @jashangoyal09, it depends on your use case. Would you be running Spark from AWS? on EMR?
Larry N. Singh
@larryns
hello, I'm trying to join an AlignmentDataset to a Dataframe. I've tried converting the AlignmentDatset to a DF, and joining. But I don't see an obvious way of converting the Dataframe back to an AlignmentDataset so I can save it as a bam. Alternatively if I can just join the AlignmentDataset to the Dataframe and get an AlignmentDataset that would work too.
I looked at the transformDataset and joining with that, but the conversion isn't quite right and I'm not sure if this is the best way. Any advice?
Michael L Heuer
@heuermh

In pseudocode the transform would look like

import org.bdgenomics.adam.sql.Alignment

val other = ...// some df
val alignments = sc.loadAlignments("sample.bam")
val joined = alignments.transformDataset(
  ds => ds.toDF().join(other).as[Alignment]
)

Another option is to use the AlignmentDataset ctr

import org.bdgenomics.adam.sql.Alignment

val other = ...// some df
val alignments = sc.loadAlignments("sample.bam")
val joinedDataframe = alignments.toDF().join(other)
val joinedDataset = joinedDataframe.as[Alignment]
val joinedAlignments = AlignmentDataset(
  joinedDataset,
  alignments.sequences,
  alignments.readGroups,
  alignments.processingSteps  
)
Larry N. Singh
@larryns
@heuermh , thanks for pseudocode. What you wrote is basically what I ended up writing, so it's good to have the confirmation that I did it the right way.
Thanks again!
Michael L Heuer
@heuermh
Good to hear!
Anton Kulaga
@antonkulaga
Guys, has anybody tested latest ADAM with hadoop 3?
Michael L Heuer
@heuermh
@antonkulaga The build seems to work ok, and I haven't run into any trouble with a local spark yet. I don't have any cluster environments with it available however. Will track progress at bigdatagenomics/adam#2267
Larry N. Singh
@larryns
I was wondering if I could get some help with paired-end bams. I'm using Adam to load a bam file and only keep those reads where at least one of the pairs overlaps a set of chromosomal locations, e.g. a bed file loaded as a dataframe.
Larry N. Singh
@larryns

My pseudocode looks like:

val readsDS = ac.loadAlignments(inputBam) // Load the bam file with paired end reads
val readsKeep = readsDS.toDF
.select("readName", "PairIndex")
.agg(count("readName").alias("countReads"), sum("PairIndex").alias("sumPairs"))) // Assuming PairIndex = 1 if 1st, and 2 if 2nd in pair
.where("countReads=2 AND sumPairs=3") // This will give a dataframe of read names to keep

// Do a "self-join" with the read names for the reads to keep
val filteredReads: AlignmentDataset = readsDS.transformDataset(
(ds: Dataset[AlignmentProduct]) =>
ds.join(
readsKeep,
ds.col("readName") === readsKeep.col("keepReads"), "left_semi")
.as[AlignmentProduct]

// .................
Is there a better way to do what I'm doing? The paired end reads are giving me a headache.

P.S. I hate bam files.
P.P.S Thanks for any advice or suggestions. :)

Oh I forgot to add a line in readsKeep to actually join with the bed file, but assume that readsDS has been already joined with the bed file, so:
val readsOrig = ac.loadAlignments(inputBam)
val readsBed = loadBedFileIntoDF(inputBed)
val readsDS = readsOrig.join(readsBed, // criteria for genomic co-ordinates)
I hope that makes sense, and once again thanks for any help.
Michael L Heuer
@heuermh
Hello @larryns! ADAM takes care of the paired end read stuff by grouping Alignments into Fragments
Michael L Heuer
@heuermh
val fragments = sc.loadFragments("sample.bam")
val features = sc.loadFeatures("features.bed")
val overlap = features.broadcastRegionJoin(fragments)
For details on other region join types, see the doc in FragmentDataset and https://adam.readthedocs.io/en/latest/api/joins/
Larry N. Singh
@larryns

hi Michael, thank you again for your help! I'd looked at Fragments, but I wanted to also be able to filter reads based on read flags, e.g. getProperPair and getPrimaryAlignment and I couldn't figure out how to access the read info to filter the Fragments.

Any suggestions? Thanks so much again!
-Larry.

Michael L Heuer
@heuermh
Fragment includes the list of grouped alignments, so you can filter on nested Alignment attributes, or alternatively read the BAM file in as alignments, filter first, and then call alignments.toFragments()
There are some predefined filter methods that operate on the dataset-as-RDD or dataset-as-Dataset you may crib from
Larry N. Singh
@larryns
thanks Michael, I'll have a look. I've been trying to read through the code on github, but am sometimes unsure of if I'm doing things the right way. Really appreciate your help.
Larry N. Singh
@larryns

Hello again... I'd been analyzing a bam file with ADAM with mapping qualities that are unavailable (i.e. mapping quality = 255). I noticed that when read into an AlignmentDataset the mapping qualities corresponding to 255 are changed null, which is fine and good. The problem is when I write the values back to file with saveAsSam, the nulls are converted to 0 not 255. Any idea what I'm doing wrong, or is there a way to get 255's instead of 0?

Thanks again for all your help and for fielding my (many) questions. :)

Larry N. Singh
@larryns

P.S. my workaround right now is to :

alignmentDS.transformDataFrame(_.na.fill(255, Seq("mappingQuality"))).saveAsSam(...)

Thanks!

Michael L Heuer
@heuermh
Hello @larryns, sorry for the delay in responding! That is a bug in our conversion process, I'll file an issue and get it fixed for the next ADAM release
Larry N. Singh
@larryns
oh ok, great, thank you @heuermh! I'm glad to help
Kaustav
@kaustv-datta
Hello Team ADAM :) I'm evaluating ADAM for a new project which deals with RNA-seq data.
I wish to perform distributed deep learning on a spark cluster.
Do you have any experience with the Horovod deep learning framework?
Do you foresee any challenges in integrating Horovod with an ADAM+Spark cluster?
Michael L Heuer
@heuermh
Hello @kaustv-datta! I do not have experience with Horovod, but it appears you can integrate in at least two different ways: 1) load RNA-seq data into Spark DataFrames via ADAM and run Horovod Spark Estimator for training on those data frames; or 2) transform RNA-seq data and write out to disk as Parquet format via ADAM and run training directly on the Parquet dataset
Please let us know how you get on, it sounds like a very interesting use case!
Kaustav
@kaustv-datta
Thanks @heuermh ! I'll be sure to let you know if I do combine Horovod+ADAM
Michael L Heuer
@heuermh
:thumbsup:
Larry N. Singh
@larryns

hi, I've been searching through the documentation but couldn't find an answer so I thought I'd ask here. I apologize in advance if this question has been asked before.

I'm using a broadcastRegionJoinAgainst(gffObject) to join reads to annotations in a gff3 file. Is strand information used for determining overlap? If yes, is there a way to enforce strand matching. If no, is there a way to disable strand matching?

Thanks for your help!
Larry.

Larry N. Singh
@larryns
Any thoughts/help on my question? I've been stuck on this problem for a while. Thanks again!
Michael L Heuer
@heuermh
Hello @larryns, sorry for the delay. There was a pull request to consider strand in region joins, but it was dropped due to bit rot (bigdatagenomics/adam#1555). It should work to region join and then filter by strand afterwards.
Larry N. Singh
@larryns
Hi @heuermh , thanks for the response. I guess I'd have to do a region join, then do another join on the gff dataframe, because in some cases I need to know what strand the gene is on, especially if there are overlapping genes at the same location but on differing strands. Thanks again.
By the way, I assume by "bit rot" you mean that bugs were introduced? If there's any way I can help with this matter, please let me know. Thanks again!
Michael L Heuer
@heuermh
Oh by bit rot I mean that too much time has passed, the pull request no longer applies cleanly, the author has moved on to other things, etc.