Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Mar 12 03:01
    Jenkins ADAM-prb success
  • Mar 12 03:01
    AmplabJenkins commented #2255
  • Mar 12 02:20
    Jenkins ADAM-prb started
  • Mar 12 02:13
    heuermh commented #2255
  • Mar 12 02:12
    Jenkins ADAM-prb failure
  • Mar 12 02:12
    AmplabJenkins commented #2255
  • Mar 12 02:11
    Jenkins ADAM-prb started
  • Mar 12 02:10
    heuermh commented #2255
  • Mar 12 02:10
    heuermh synchronize #2255
  • Mar 12 00:30
    karenfeng commented #2255
  • Mar 09 21:36
    shaneknapp commented #2255
  • Mar 09 21:35
    heuermh commented #2255
  • Mar 09 21:33
    shaneknapp commented #2255
  • Mar 09 21:33
    shaneknapp commented #2255
  • Mar 09 21:32
    Jenkins ADAM-prb success
  • Mar 09 21:32
    AmplabJenkins commented #2255
  • Mar 09 20:54
    Jenkins ADAM-prb started
  • Mar 09 20:51
    shaneknapp commented #2255
  • Mar 09 20:51
    shaneknapp commented #2255
Roman Valls Guimera
@brainstorm
Would it be possible to have BAM virtual offsets as an (optional) transformAlignments output parquet column(s)?
Context: rust-bio/rust-htslib#40 ... I would like to have those as part of the output parquet file. Currently I'm trying to do that outside of ADAM as an extra step: https://github.com/brainstorm/htsget-indexer ... but I was wondering if this is a feature/option you had in mind at some point?
Michael L Heuer
@heuermh
Where would the BAM virtual offsets for each record come from?
Roman Valls Guimera
@brainstorm
From either the BAM itself or the corresponding .BAI
According to spec, those are calculated with some bit shifting... not sure if htsjdk has that in place, but it could/should come from there in the case of java, I guess.
htslib-rs has .tell/.seek methods to retrieve those
Michael L Heuer
@heuermh
ADAM is reading from BAM files split/partitioned across executor nodes, so any accumulated byte counts are only going to be in the context of a single partition.
Michael L Heuer
@heuermh
I don't know that much about what htsjdk support for .BAI files looks like, will have to dig
Roman Valls Guimera
@brainstorm
Yeah, essentially the voffsets are to be paired with genomics coordinates, see here for a more straightforward and low level explanation: https://delagoya.github.io/rusty-bio/binary-file-parsing-in-bioinformatics
Roman Valls Guimera
@brainstorm
@heuermh Would it be acceptable to modify the bdg.avdl to include a couple of fields such as coffset/uoffset? I would like to avoid chucking those into the additional attributes: https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl#L446
Roman Valls Guimera
@brainstorm
(acceptable as a pull request, that is)
Michael L Heuer
@heuermh
I guess I'm on the fence, since those fields are specific to a flat file implementation that frankly we'd like to see go away. If an argument can be made that we need to keep those around in order to properly round-trip, and ADAM reads and writes those as a course of "normal business", then I might be convinced
Perhaps you might discuss the use case more? If alignment records are in Parquet on disk, and you need quick access to region based queries, we've persued those use cases with IntervalRDD in Mango, and by writing Parquet partitioned by region, i.e. AlignmentDataset.saveAsPartitionedParquet(partitionSize: 100000)
Roman Valls Guimera
@brainstorm
Thanks Michael for the pointer and explanation. I do agree and thought about the inconvenience of putting those low level details (although just as an optional field)... the main use case would be https://samtools.github.io/hts-specs/htsget.html, where a particular genomic coordinate should return a JSON with matching reads and accompanying range of bytes from the original file.
I can see how Mango encompasses a lot of the fast indexing/interval tree ideas that htsget needs on the backend, so that's a big win/realization I did not have before... I'll explore a bit more Mango and see if there's a chance to expose those intervals as a htsget-like REST API.
JavaAndPython55
@JavaAndPython55
With regard to the problem that China cannot be installed, it is suggested that Failed to execute goal on project adam-core-spark2_2.11: Could not resolve dependencies for project org.bdgenomics.adam:adam-core-spark2_2.11:jar:0.30.0-SNAPSHOT: The following artifacts could not be resolved: org.apache.avro:avro-mapred:jar:hadoop2:1.8.2, hopes to have a download source from China. If I give me the file, I can provide a compiled download source for adamundefineds mvn in China. For use by adam personnel in China to promote the spread of adam in China
jamesthegiantpeach
@jamesthegiantpeach
Any1 have an example of using adam to bin reads and run haplotypecaller over the bins? Wanting to run some avocado benchmarks against haplotypecaller, but i'd rather not use their spark implementation. Looking to just bin reads and run the single threaded version on each of the bins
Michael L Heuer
@heuermh
You can use adam to write out Hive-style region partitioned Parquet, and then load by region and write to BAM, but I imagine that might not be any more efficient than pulling from an indexed BAM
val alignments = sc.loadAlignments("sample.bam")
alignments.saveAsPartitionedParquet("sample.partitioned.adam")
val bin = sc.loadPartitionedParquetAlignments("sample.partitioned.adam", regions = Seq(...))
jamesthegiantpeach
@jamesthegiantpeach
Hmm I think that almost works. I think I would need to figure out how to "slop" those partitions in a way similar to this (https://bedtools.readthedocs.io/en/latest/content/tools/slop.html)
Michael L Heuer
@heuermh
That is how our pipe APIs work, flankSize is the slop between partitions sent to external commands
jamesthegiantpeach
@jamesthegiantpeach
OH! Ok I just need to familiarize myself with the API more. Thanks
Michael L Heuer
@heuermh
The partitions in Hive-style region partitioned Parquet are the foldered partitions on disk, not necessarily how Spark might partition, overlapping terminology unfortunately
jamesthegiantpeach
@jamesthegiantpeach

I sometimes get this error when running bwa and i'm not sure why:

Caused by: java.util.NoSuchElementException: key not found: sample
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:59)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:59)
    at org.bdgenomics.adam.models.ReadGroupDictionary.apply(ReadGroupDictionary.scala:133)
    at org.bdgenomics.adam.converters.AlignmentConverter$$anonfun$convert$1$$anonfun$apply$1.apply(AlignmentConverter.scala:259)
    at org.bdgenomics.adam.converters.AlignmentConverter$$anonfun$convert$1$$anonfun$apply$1.apply(AlignmentConverter.scala:257)

It comes from this. Mid alignment the readgroupmap doesn't contain the sample name assigned.

Just a local test.

        val readFile1 = "/home/james/genomic_data/reads/test_1.fq"
        val readFile2 = "/home/james/genomic_data/reads/test_2.fq"
        val ref       = "/home/james/genomic_data/reference/human_g1k_v37.fasta"

        val args = new BwaArgs()
        args.indexPath = ref
        args.sample = "sample"

        val reads = sc.loadPairedFastqAsFragments(readFile1, readFile2)
        val alignments = reads.alignWithBwa(args)
        alignments.saveAsSam("/home/james/genomic_data/reads/test_1.bam", asSingleFile = true)
Michael L Heuer
@heuermh
Further up the stack trace, is this exception thrown during saveAsSam? I believe the alignWithBwa step is probably going fine, the problem may be that alignments does not have any read groups when trying to convert to BAM.
The cannoli bwa command line class does two extra steps after piping out to bwa, creating read groups and references
https://github.com/bigdatagenomics/cannoli/blob/master/cli/src/main/scala/org/bdgenomics/cannoli/cli/Bwa.scala#L102
jamesthegiantpeach
@jamesthegiantpeach

Interesting error

Exception in thread "main" java.lang.IllegalArgumentException: Fastq 1 (s3a://bgi-concordance-studies/run190704NovaSeq_R1_A501.fastq.gz.corrected.fastq.err_barcode_removed.fixed.fastq.gz) has 1023782058 reads, fastq 2 (s3a://bgi-concordance-studies/run190704NovaSeq_R2_A501.fastq.gz.corrected.fastq.err_barcode_removed.fixed.fastq.gz) has 1023782057 reads
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastq$1.apply(ADAMContext.scala:2536)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastq$1.apply(ADAMContext.scala:2513)
    at scala.Option.fold(Option.scala:158)
    at org.apache.spark.rdd.Timer.time(Timer.scala:48)
    at org.bdgenomics.adam.rdd.ADAMContext.loadPairedFastq(ADAMContext.scala:2513)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastqAsFragments$1.apply(ADAMContext.scala:3059)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastqAsFragments$1.apply(ADAMContext.scala:3059)
    at scala.Option.fold(Option.scala:158)
    at org.apache.spark.rdd.Timer.time(Timer.scala:48)
    at org.bdgenomics.adam.rdd.ADAMContext.loadPairedFastqAsFragments(ADAMContext.scala:3057)
    at org.myome.bigbioinf.Main$.main(Main.scala:39)
    at org.myome.bigbioinf.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The true number of reads is 1023782052

jamesthegiantpeach
@jamesthegiantpeach

https://github.com/bigdatagenomics/adam/blob/dd99cad756ec486efdc896cb1cf7d5080a9c95dd/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L2533

Somehow this is conditional is failing even though i have confirmed that the number of reads are the same

jamesthegiantpeach
@jamesthegiantpeach

@A00741:47:HCM53DRXX:1:1101:1072:1000/1

Example bar code

not sure if the bar code is causing a miscount or what
Karen Feng
@karenfeng
When can we anticipate an ADAM release on Spark 3?
Michael L Heuer
@heuermh
Well, Spark 3 isn't out yet ;)
Michael L Heuer
@heuermh
The big problem is the utils-metrics module, which is deeply dependent on internal Spark classes. In the pull request above, I started down the path of removing the ADAM dependency on utils-metrics, but that is a big effort, and losing metrics would be unfortunate
Michael L Heuer
@heuermh
As an update, the utils-metrics module was deprecated in release 0.2.16 and will be removed in version 0.3.0, which will also cross-build Spark 2 and Spark 3.
ADAM version 0.31.0 will be released soon, with deprecations from utils-metrics 0.2.16 and others. Then ADAM version 0.32.0 will depend on utils version 0.3.0 and will also cross-build Spark 2 and Spark 3.
Michael L Heuer
@heuermh
Michael L Heuer
@heuermh
If there are any other issues/pull requests folks would like to see triaged to the 0.31.0 release, please let me know
Karen Feng
@karenfeng
Awesome, thanks Michael. When would you ballpark the release for 0.32.0?
Michael L Heuer
@heuermh
Not exactly sure, bdg-utils 0.3.0 and ADAM 0.32.0 should probably wait for Spark 3.0 proper instead of the preview releases
And I don't have access to any Spark 3 cluster environments for testing, e.g. AWS EMR is still at Spark 2.4.4
jamesthegiantpeach
@jamesthegiantpeach
How do you shard fastq bgz files without having an index? That is, how do you seek into a bgz fastq file and know that a block doesn't have half of a fastq record in it? Couldn't find the code in the repo
Michael L Heuer
@heuermh
Block-gzipped (bgz/bgzf) FASTQ files are supported by the code in this package https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/java/org/bdgenomics/adam/io
Most of it came in on this commit bigdatagenomics/adam@985e5d8
Jashan Goyal
@jashangoyal09
Hi folk, How can i get files in avro format from s3 buket?
Michael L Heuer
@heuermh
Hello @jashangoyal09, it depends on your use case. Would you be running Spark from AWS? on EMR?