Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Michael L Heuer
@heuermh
The partitions in Hive-style region partitioned Parquet are the foldered partitions on disk, not necessarily how Spark might partition, overlapping terminology unfortunately
jamesthegiantpeach
@jamesthegiantpeach

I sometimes get this error when running bwa and i'm not sure why:

Caused by: java.util.NoSuchElementException: key not found: sample
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:59)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:59)
    at org.bdgenomics.adam.models.ReadGroupDictionary.apply(ReadGroupDictionary.scala:133)
    at org.bdgenomics.adam.converters.AlignmentConverter$$anonfun$convert$1$$anonfun$apply$1.apply(AlignmentConverter.scala:259)
    at org.bdgenomics.adam.converters.AlignmentConverter$$anonfun$convert$1$$anonfun$apply$1.apply(AlignmentConverter.scala:257)

It comes from this. Mid alignment the readgroupmap doesn't contain the sample name assigned.

Just a local test.

        val readFile1 = "/home/james/genomic_data/reads/test_1.fq"
        val readFile2 = "/home/james/genomic_data/reads/test_2.fq"
        val ref       = "/home/james/genomic_data/reference/human_g1k_v37.fasta"

        val args = new BwaArgs()
        args.indexPath = ref
        args.sample = "sample"

        val reads = sc.loadPairedFastqAsFragments(readFile1, readFile2)
        val alignments = reads.alignWithBwa(args)
        alignments.saveAsSam("/home/james/genomic_data/reads/test_1.bam", asSingleFile = true)
Michael L Heuer
@heuermh
Further up the stack trace, is this exception thrown during saveAsSam? I believe the alignWithBwa step is probably going fine, the problem may be that alignments does not have any read groups when trying to convert to BAM.
The cannoli bwa command line class does two extra steps after piping out to bwa, creating read groups and references
https://github.com/bigdatagenomics/cannoli/blob/master/cli/src/main/scala/org/bdgenomics/cannoli/cli/Bwa.scala#L102
jamesthegiantpeach
@jamesthegiantpeach

Interesting error

Exception in thread "main" java.lang.IllegalArgumentException: Fastq 1 (s3a://bgi-concordance-studies/run190704NovaSeq_R1_A501.fastq.gz.corrected.fastq.err_barcode_removed.fixed.fastq.gz) has 1023782058 reads, fastq 2 (s3a://bgi-concordance-studies/run190704NovaSeq_R2_A501.fastq.gz.corrected.fastq.err_barcode_removed.fixed.fastq.gz) has 1023782057 reads
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastq$1.apply(ADAMContext.scala:2536)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastq$1.apply(ADAMContext.scala:2513)
    at scala.Option.fold(Option.scala:158)
    at org.apache.spark.rdd.Timer.time(Timer.scala:48)
    at org.bdgenomics.adam.rdd.ADAMContext.loadPairedFastq(ADAMContext.scala:2513)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastqAsFragments$1.apply(ADAMContext.scala:3059)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastqAsFragments$1.apply(ADAMContext.scala:3059)
    at scala.Option.fold(Option.scala:158)
    at org.apache.spark.rdd.Timer.time(Timer.scala:48)
    at org.bdgenomics.adam.rdd.ADAMContext.loadPairedFastqAsFragments(ADAMContext.scala:3057)
    at org.myome.bigbioinf.Main$.main(Main.scala:39)
    at org.myome.bigbioinf.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The true number of reads is 1023782052

jamesthegiantpeach
@jamesthegiantpeach

https://github.com/bigdatagenomics/adam/blob/dd99cad756ec486efdc896cb1cf7d5080a9c95dd/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L2533

Somehow this is conditional is failing even though i have confirmed that the number of reads are the same

jamesthegiantpeach
@jamesthegiantpeach

@A00741:47:HCM53DRXX:1:1101:1072:1000/1

Example bar code

not sure if the bar code is causing a miscount or what
Karen Feng
@karenfeng
When can we anticipate an ADAM release on Spark 3?
Michael L Heuer
@heuermh
Well, Spark 3 isn't out yet ;)
Michael L Heuer
@heuermh
The big problem is the utils-metrics module, which is deeply dependent on internal Spark classes. In the pull request above, I started down the path of removing the ADAM dependency on utils-metrics, but that is a big effort, and losing metrics would be unfortunate
Michael L Heuer
@heuermh
As an update, the utils-metrics module was deprecated in release 0.2.16 and will be removed in version 0.3.0, which will also cross-build Spark 2 and Spark 3.
ADAM version 0.31.0 will be released soon, with deprecations from utils-metrics 0.2.16 and others. Then ADAM version 0.32.0 will depend on utils version 0.3.0 and will also cross-build Spark 2 and Spark 3.
Michael L Heuer
@heuermh
Michael L Heuer
@heuermh
If there are any other issues/pull requests folks would like to see triaged to the 0.31.0 release, please let me know
Karen Feng
@karenfeng
Awesome, thanks Michael. When would you ballpark the release for 0.32.0?
Michael L Heuer
@heuermh
Not exactly sure, bdg-utils 0.3.0 and ADAM 0.32.0 should probably wait for Spark 3.0 proper instead of the preview releases
And I don't have access to any Spark 3 cluster environments for testing, e.g. AWS EMR is still at Spark 2.4.4
jamesthegiantpeach
@jamesthegiantpeach
How do you shard fastq bgz files without having an index? That is, how do you seek into a bgz fastq file and know that a block doesn't have half of a fastq record in it? Couldn't find the code in the repo
Michael L Heuer
@heuermh
Block-gzipped (bgz/bgzf) FASTQ files are supported by the code in this package https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/java/org/bdgenomics/adam/io
Most of it came in on this commit bigdatagenomics/adam@985e5d8
Jashan Goyal
@jashangoyal09
Hi folk, How can i get files in avro format from s3 buket?
Michael L Heuer
@heuermh
Hello @jashangoyal09, it depends on your use case. Would you be running Spark from AWS? on EMR?
Larry N. Singh
@larryns
hello, I'm trying to join an AlignmentDataset to a Dataframe. I've tried converting the AlignmentDatset to a DF, and joining. But I don't see an obvious way of converting the Dataframe back to an AlignmentDataset so I can save it as a bam. Alternatively if I can just join the AlignmentDataset to the Dataframe and get an AlignmentDataset that would work too.
I looked at the transformDataset and joining with that, but the conversion isn't quite right and I'm not sure if this is the best way. Any advice?
Michael L Heuer
@heuermh

In pseudocode the transform would look like

import org.bdgenomics.adam.sql.Alignment

val other = ...// some df
val alignments = sc.loadAlignments("sample.bam")
val joined = alignments.transformDataset(
  ds => ds.toDF().join(other).as[Alignment]
)

Another option is to use the AlignmentDataset ctr

import org.bdgenomics.adam.sql.Alignment

val other = ...// some df
val alignments = sc.loadAlignments("sample.bam")
val joinedDataframe = alignments.toDF().join(other)
val joinedDataset = joinedDataframe.as[Alignment]
val joinedAlignments = AlignmentDataset(
  joinedDataset,
  alignments.sequences,
  alignments.readGroups,
  alignments.processingSteps  
)
Larry N. Singh
@larryns
@heuermh , thanks for pseudocode. What you wrote is basically what I ended up writing, so it's good to have the confirmation that I did it the right way.
Thanks again!
Michael L Heuer
@heuermh
Good to hear!
Anton Kulaga
@antonkulaga
Guys, has anybody tested latest ADAM with hadoop 3?
Michael L Heuer
@heuermh
@antonkulaga The build seems to work ok, and I haven't run into any trouble with a local spark yet. I don't have any cluster environments with it available however. Will track progress at bigdatagenomics/adam#2267
Larry N. Singh
@larryns
I was wondering if I could get some help with paired-end bams. I'm using Adam to load a bam file and only keep those reads where at least one of the pairs overlaps a set of chromosomal locations, e.g. a bed file loaded as a dataframe.
Larry N. Singh
@larryns

My pseudocode looks like:

val readsDS = ac.loadAlignments(inputBam) // Load the bam file with paired end reads
val readsKeep = readsDS.toDF
.select("readName", "PairIndex")
.agg(count("readName").alias("countReads"), sum("PairIndex").alias("sumPairs"))) // Assuming PairIndex = 1 if 1st, and 2 if 2nd in pair
.where("countReads=2 AND sumPairs=3") // This will give a dataframe of read names to keep

// Do a "self-join" with the read names for the reads to keep
val filteredReads: AlignmentDataset = readsDS.transformDataset(
(ds: Dataset[AlignmentProduct]) =>
ds.join(
readsKeep,
ds.col("readName") === readsKeep.col("keepReads"), "left_semi")
.as[AlignmentProduct]

// .................
Is there a better way to do what I'm doing? The paired end reads are giving me a headache.

P.S. I hate bam files.
P.P.S Thanks for any advice or suggestions. :)

Oh I forgot to add a line in readsKeep to actually join with the bed file, but assume that readsDS has been already joined with the bed file, so:
val readsOrig = ac.loadAlignments(inputBam)
val readsBed = loadBedFileIntoDF(inputBed)
val readsDS = readsOrig.join(readsBed, // criteria for genomic co-ordinates)
I hope that makes sense, and once again thanks for any help.
Michael L Heuer
@heuermh
Hello @larryns! ADAM takes care of the paired end read stuff by grouping Alignments into Fragments
Michael L Heuer
@heuermh
val fragments = sc.loadFragments("sample.bam")
val features = sc.loadFeatures("features.bed")
val overlap = features.broadcastRegionJoin(fragments)
For details on other region join types, see the doc in FragmentDataset and https://adam.readthedocs.io/en/latest/api/joins/
Larry N. Singh
@larryns

hi Michael, thank you again for your help! I'd looked at Fragments, but I wanted to also be able to filter reads based on read flags, e.g. getProperPair and getPrimaryAlignment and I couldn't figure out how to access the read info to filter the Fragments.

Any suggestions? Thanks so much again!
-Larry.

Michael L Heuer
@heuermh
Fragment includes the list of grouped alignments, so you can filter on nested Alignment attributes, or alternatively read the BAM file in as alignments, filter first, and then call alignments.toFragments()
There are some predefined filter methods that operate on the dataset-as-RDD or dataset-as-Dataset you may crib from
Larry N. Singh
@larryns
thanks Michael, I'll have a look. I've been trying to read through the code on github, but am sometimes unsure of if I'm doing things the right way. Really appreciate your help.
Larry N. Singh
@larryns

Hello again... I'd been analyzing a bam file with ADAM with mapping qualities that are unavailable (i.e. mapping quality = 255). I noticed that when read into an AlignmentDataset the mapping qualities corresponding to 255 are changed null, which is fine and good. The problem is when I write the values back to file with saveAsSam, the nulls are converted to 0 not 255. Any idea what I'm doing wrong, or is there a way to get 255's instead of 0?

Thanks again for all your help and for fielding my (many) questions. :)

Larry N. Singh
@larryns

P.S. my workaround right now is to :

alignmentDS.transformDataFrame(_.na.fill(255, Seq("mappingQuality"))).saveAsSam(...)

Thanks!

Michael L Heuer
@heuermh
Hello @larryns, sorry for the delay in responding! That is a bug in our conversion process, I'll file an issue and get it fixed for the next ADAM release
Larry N. Singh
@larryns
oh ok, great, thank you @heuermh! I'm glad to help
Kaustav
@kaustv-datta
Hello Team ADAM :) I'm evaluating ADAM for a new project which deals with RNA-seq data.
I wish to perform distributed deep learning on a spark cluster.
Do you have any experience with the Horovod deep learning framework?
Do you foresee any challenges in integrating Horovod with an ADAM+Spark cluster?
Michael L Heuer
@heuermh
Hello @kaustv-datta! I do not have experience with Horovod, but it appears you can integrate in at least two different ways: 1) load RNA-seq data into Spark DataFrames via ADAM and run Horovod Spark Estimator for training on those data frames; or 2) transform RNA-seq data and write out to disk as Parquet format via ADAM and run training directly on the Parquet dataset