Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Roman Valls Guimera
@brainstorm
Thanks Michael for the pointer and explanation. I do agree and thought about the inconvenience of putting those low level details (although just as an optional field)... the main use case would be https://samtools.github.io/hts-specs/htsget.html, where a particular genomic coordinate should return a JSON with matching reads and accompanying range of bytes from the original file.
I can see how Mango encompasses a lot of the fast indexing/interval tree ideas that htsget needs on the backend, so that's a big win/realization I did not have before... I'll explore a bit more Mango and see if there's a chance to expose those intervals as a htsget-like REST API.
JavaAndPython55
@JavaAndPython55
With regard to the problem that China cannot be installed, it is suggested that Failed to execute goal on project adam-core-spark2_2.11: Could not resolve dependencies for project org.bdgenomics.adam:adam-core-spark2_2.11:jar:0.30.0-SNAPSHOT: The following artifacts could not be resolved: org.apache.avro:avro-mapred:jar:hadoop2:1.8.2, hopes to have a download source from China. If I give me the file, I can provide a compiled download source for adamundefineds mvn in China. For use by adam personnel in China to promote the spread of adam in China
jamesthegiantpeach
@jamesthegiantpeach
Any1 have an example of using adam to bin reads and run haplotypecaller over the bins? Wanting to run some avocado benchmarks against haplotypecaller, but i'd rather not use their spark implementation. Looking to just bin reads and run the single threaded version on each of the bins
Michael L Heuer
@heuermh
You can use adam to write out Hive-style region partitioned Parquet, and then load by region and write to BAM, but I imagine that might not be any more efficient than pulling from an indexed BAM
val alignments = sc.loadAlignments("sample.bam")
alignments.saveAsPartitionedParquet("sample.partitioned.adam")
val bin = sc.loadPartitionedParquetAlignments("sample.partitioned.adam", regions = Seq(...))
jamesthegiantpeach
@jamesthegiantpeach
Hmm I think that almost works. I think I would need to figure out how to "slop" those partitions in a way similar to this (https://bedtools.readthedocs.io/en/latest/content/tools/slop.html)
Michael L Heuer
@heuermh
That is how our pipe APIs work, flankSize is the slop between partitions sent to external commands
jamesthegiantpeach
@jamesthegiantpeach
OH! Ok I just need to familiarize myself with the API more. Thanks
Michael L Heuer
@heuermh
The partitions in Hive-style region partitioned Parquet are the foldered partitions on disk, not necessarily how Spark might partition, overlapping terminology unfortunately
jamesthegiantpeach
@jamesthegiantpeach

I sometimes get this error when running bwa and i'm not sure why:

Caused by: java.util.NoSuchElementException: key not found: sample
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:59)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:59)
    at org.bdgenomics.adam.models.ReadGroupDictionary.apply(ReadGroupDictionary.scala:133)
    at org.bdgenomics.adam.converters.AlignmentConverter$$anonfun$convert$1$$anonfun$apply$1.apply(AlignmentConverter.scala:259)
    at org.bdgenomics.adam.converters.AlignmentConverter$$anonfun$convert$1$$anonfun$apply$1.apply(AlignmentConverter.scala:257)

It comes from this. Mid alignment the readgroupmap doesn't contain the sample name assigned.

Just a local test.

        val readFile1 = "/home/james/genomic_data/reads/test_1.fq"
        val readFile2 = "/home/james/genomic_data/reads/test_2.fq"
        val ref       = "/home/james/genomic_data/reference/human_g1k_v37.fasta"

        val args = new BwaArgs()
        args.indexPath = ref
        args.sample = "sample"

        val reads = sc.loadPairedFastqAsFragments(readFile1, readFile2)
        val alignments = reads.alignWithBwa(args)
        alignments.saveAsSam("/home/james/genomic_data/reads/test_1.bam", asSingleFile = true)
Michael L Heuer
@heuermh
Further up the stack trace, is this exception thrown during saveAsSam? I believe the alignWithBwa step is probably going fine, the problem may be that alignments does not have any read groups when trying to convert to BAM.
The cannoli bwa command line class does two extra steps after piping out to bwa, creating read groups and references
https://github.com/bigdatagenomics/cannoli/blob/master/cli/src/main/scala/org/bdgenomics/cannoli/cli/Bwa.scala#L102
jamesthegiantpeach
@jamesthegiantpeach

Interesting error

Exception in thread "main" java.lang.IllegalArgumentException: Fastq 1 (s3a://bgi-concordance-studies/run190704NovaSeq_R1_A501.fastq.gz.corrected.fastq.err_barcode_removed.fixed.fastq.gz) has 1023782058 reads, fastq 2 (s3a://bgi-concordance-studies/run190704NovaSeq_R2_A501.fastq.gz.corrected.fastq.err_barcode_removed.fixed.fastq.gz) has 1023782057 reads
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastq$1.apply(ADAMContext.scala:2536)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastq$1.apply(ADAMContext.scala:2513)
    at scala.Option.fold(Option.scala:158)
    at org.apache.spark.rdd.Timer.time(Timer.scala:48)
    at org.bdgenomics.adam.rdd.ADAMContext.loadPairedFastq(ADAMContext.scala:2513)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastqAsFragments$1.apply(ADAMContext.scala:3059)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastqAsFragments$1.apply(ADAMContext.scala:3059)
    at scala.Option.fold(Option.scala:158)
    at org.apache.spark.rdd.Timer.time(Timer.scala:48)
    at org.bdgenomics.adam.rdd.ADAMContext.loadPairedFastqAsFragments(ADAMContext.scala:3057)
    at org.myome.bigbioinf.Main$.main(Main.scala:39)
    at org.myome.bigbioinf.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The true number of reads is 1023782052

jamesthegiantpeach
@jamesthegiantpeach

https://github.com/bigdatagenomics/adam/blob/dd99cad756ec486efdc896cb1cf7d5080a9c95dd/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L2533

Somehow this is conditional is failing even though i have confirmed that the number of reads are the same

jamesthegiantpeach
@jamesthegiantpeach

@A00741:47:HCM53DRXX:1:1101:1072:1000/1

Example bar code

not sure if the bar code is causing a miscount or what
Karen Feng
@karenfeng
When can we anticipate an ADAM release on Spark 3?
Michael L Heuer
@heuermh
Well, Spark 3 isn't out yet ;)
Michael L Heuer
@heuermh
The big problem is the utils-metrics module, which is deeply dependent on internal Spark classes. In the pull request above, I started down the path of removing the ADAM dependency on utils-metrics, but that is a big effort, and losing metrics would be unfortunate
Michael L Heuer
@heuermh
As an update, the utils-metrics module was deprecated in release 0.2.16 and will be removed in version 0.3.0, which will also cross-build Spark 2 and Spark 3.
ADAM version 0.31.0 will be released soon, with deprecations from utils-metrics 0.2.16 and others. Then ADAM version 0.32.0 will depend on utils version 0.3.0 and will also cross-build Spark 2 and Spark 3.
Michael L Heuer
@heuermh
Michael L Heuer
@heuermh
If there are any other issues/pull requests folks would like to see triaged to the 0.31.0 release, please let me know
Karen Feng
@karenfeng
Awesome, thanks Michael. When would you ballpark the release for 0.32.0?
Michael L Heuer
@heuermh
Not exactly sure, bdg-utils 0.3.0 and ADAM 0.32.0 should probably wait for Spark 3.0 proper instead of the preview releases
And I don't have access to any Spark 3 cluster environments for testing, e.g. AWS EMR is still at Spark 2.4.4
jamesthegiantpeach
@jamesthegiantpeach
How do you shard fastq bgz files without having an index? That is, how do you seek into a bgz fastq file and know that a block doesn't have half of a fastq record in it? Couldn't find the code in the repo
Michael L Heuer
@heuermh
Block-gzipped (bgz/bgzf) FASTQ files are supported by the code in this package https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/java/org/bdgenomics/adam/io
Most of it came in on this commit bigdatagenomics/adam@985e5d8
Jashan Goyal
@jashangoyal09
Hi folk, How can i get files in avro format from s3 buket?
Michael L Heuer
@heuermh
Hello @jashangoyal09, it depends on your use case. Would you be running Spark from AWS? on EMR?
Larry N. Singh
@larryns
hello, I'm trying to join an AlignmentDataset to a Dataframe. I've tried converting the AlignmentDatset to a DF, and joining. But I don't see an obvious way of converting the Dataframe back to an AlignmentDataset so I can save it as a bam. Alternatively if I can just join the AlignmentDataset to the Dataframe and get an AlignmentDataset that would work too.
I looked at the transformDataset and joining with that, but the conversion isn't quite right and I'm not sure if this is the best way. Any advice?
Michael L Heuer
@heuermh

In pseudocode the transform would look like

import org.bdgenomics.adam.sql.Alignment

val other = ...// some df
val alignments = sc.loadAlignments("sample.bam")
val joined = alignments.transformDataset(
  ds => ds.toDF().join(other).as[Alignment]
)

Another option is to use the AlignmentDataset ctr

import org.bdgenomics.adam.sql.Alignment

val other = ...// some df
val alignments = sc.loadAlignments("sample.bam")
val joinedDataframe = alignments.toDF().join(other)
val joinedDataset = joinedDataframe.as[Alignment]
val joinedAlignments = AlignmentDataset(
  joinedDataset,
  alignments.sequences,
  alignments.readGroups,
  alignments.processingSteps  
)
Larry N. Singh
@larryns
@heuermh , thanks for pseudocode. What you wrote is basically what I ended up writing, so it's good to have the confirmation that I did it the right way.
Thanks again!
Michael L Heuer
@heuermh
Good to hear!
Anton Kulaga
@antonkulaga
Guys, has anybody tested latest ADAM with hadoop 3?
Michael L Heuer
@heuermh
@antonkulaga The build seems to work ok, and I haven't run into any trouble with a local spark yet. I don't have any cluster environments with it available however. Will track progress at bigdatagenomics/adam#2267
Larry N. Singh
@larryns
I was wondering if I could get some help with paired-end bams. I'm using Adam to load a bam file and only keep those reads where at least one of the pairs overlaps a set of chromosomal locations, e.g. a bed file loaded as a dataframe.
Larry N. Singh
@larryns

My pseudocode looks like:

val readsDS = ac.loadAlignments(inputBam) // Load the bam file with paired end reads
val readsKeep = readsDS.toDF
.select("readName", "PairIndex")
.agg(count("readName").alias("countReads"), sum("PairIndex").alias("sumPairs"))) // Assuming PairIndex = 1 if 1st, and 2 if 2nd in pair
.where("countReads=2 AND sumPairs=3") // This will give a dataframe of read names to keep

// Do a "self-join" with the read names for the reads to keep
val filteredReads: AlignmentDataset = readsDS.transformDataset(
(ds: Dataset[AlignmentProduct]) =>
ds.join(
readsKeep,
ds.col("readName") === readsKeep.col("keepReads"), "left_semi")
.as[AlignmentProduct]

// .................
Is there a better way to do what I'm doing? The paired end reads are giving me a headache.

P.S. I hate bam files.
P.P.S Thanks for any advice or suggestions. :)

Oh I forgot to add a line in readsKeep to actually join with the bed file, but assume that readsDS has been already joined with the bed file, so:
val readsOrig = ac.loadAlignments(inputBam)
val readsBed = loadBedFileIntoDF(inputBed)
val readsDS = readsOrig.join(readsBed, // criteria for genomic co-ordinates)
I hope that makes sense, and once again thanks for any help.
Michael L Heuer
@heuermh
Hello @larryns! ADAM takes care of the paired end read stuff by grouping Alignments into Fragments
Michael L Heuer
@heuermh
val fragments = sc.loadFragments("sample.bam")
val features = sc.loadFeatures("features.bed")
val overlap = features.broadcastRegionJoin(fragments)
For details on other region join types, see the doc in FragmentDataset and https://adam.readthedocs.io/en/latest/api/joins/
Larry N. Singh
@larryns

hi Michael, thank you again for your help! I'd looked at Fragments, but I wanted to also be able to filter reads based on read flags, e.g. getProperPair and getPrimaryAlignment and I couldn't figure out how to access the read info to filter the Fragments.

Any suggestions? Thanks so much again!
-Larry.