by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jul 05 19:53
    Jenkins cannoli-prb success
  • Jul 05 19:51
    Jenkins cannoli-prb started
  • Jul 05 11:20
  • Jul 02 18:19
    Jenkins cannoli-prb success
  • Jul 02 18:17
    Jenkins cannoli-prb started
  • Jun 28 03:26
    Jenkins cannoli-prb success
  • Jun 28 03:25
    Jenkins cannoli-prb started
  • Jun 28 00:00
    Jenkins cannoli-prb failure
  • Jun 27 23:59
    Jenkins cannoli-prb started
  • Jun 27 23:55
    heuermh closed #2265
  • Jun 27 23:55
    heuermh commented #2265
  • Jun 27 21:39
    heuermh edited #2265
  • Jun 27 21:39
    heuermh opened #2265
  • Jun 27 21:39
    heuermh milestoned #2265
  • Jun 27 20:15
    Jenkins cannoli-prb success
  • Jun 27 20:13
    Jenkins cannoli-prb started
  • Jun 26 23:56
    heuermh closed #2258
  • Jun 26 23:56
    heuermh commented #2258
  • Jun 26 15:40
    heuermh closed #1851
  • Jun 26 15:40
    heuermh commented #1851
Michael L Heuer
@heuermh
ADAM is reading from BAM files split/partitioned across executor nodes, so any accumulated byte counts are only going to be in the context of a single partition.
Michael L Heuer
@heuermh
I don't know that much about what htsjdk support for .BAI files looks like, will have to dig
Roman Valls Guimera
@brainstorm
Yeah, essentially the voffsets are to be paired with genomics coordinates, see here for a more straightforward and low level explanation: https://delagoya.github.io/rusty-bio/binary-file-parsing-in-bioinformatics
Roman Valls Guimera
@brainstorm
@heuermh Would it be acceptable to modify the bdg.avdl to include a couple of fields such as coffset/uoffset? I would like to avoid chucking those into the additional attributes: https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl#L446
Roman Valls Guimera
@brainstorm
(acceptable as a pull request, that is)
Michael L Heuer
@heuermh
I guess I'm on the fence, since those fields are specific to a flat file implementation that frankly we'd like to see go away. If an argument can be made that we need to keep those around in order to properly round-trip, and ADAM reads and writes those as a course of "normal business", then I might be convinced
Perhaps you might discuss the use case more? If alignment records are in Parquet on disk, and you need quick access to region based queries, we've persued those use cases with IntervalRDD in Mango, and by writing Parquet partitioned by region, i.e. AlignmentDataset.saveAsPartitionedParquet(partitionSize: 100000)
Roman Valls Guimera
@brainstorm
Thanks Michael for the pointer and explanation. I do agree and thought about the inconvenience of putting those low level details (although just as an optional field)... the main use case would be https://samtools.github.io/hts-specs/htsget.html, where a particular genomic coordinate should return a JSON with matching reads and accompanying range of bytes from the original file.
I can see how Mango encompasses a lot of the fast indexing/interval tree ideas that htsget needs on the backend, so that's a big win/realization I did not have before... I'll explore a bit more Mango and see if there's a chance to expose those intervals as a htsget-like REST API.
JavaAndPython55
@JavaAndPython55
With regard to the problem that China cannot be installed, it is suggested that Failed to execute goal on project adam-core-spark2_2.11: Could not resolve dependencies for project org.bdgenomics.adam:adam-core-spark2_2.11:jar:0.30.0-SNAPSHOT: The following artifacts could not be resolved: org.apache.avro:avro-mapred:jar:hadoop2:1.8.2, hopes to have a download source from China. If I give me the file, I can provide a compiled download source for adamundefineds mvn in China. For use by adam personnel in China to promote the spread of adam in China
jamesthegiantpeach
@jamesthegiantpeach
Any1 have an example of using adam to bin reads and run haplotypecaller over the bins? Wanting to run some avocado benchmarks against haplotypecaller, but i'd rather not use their spark implementation. Looking to just bin reads and run the single threaded version on each of the bins
Michael L Heuer
@heuermh
You can use adam to write out Hive-style region partitioned Parquet, and then load by region and write to BAM, but I imagine that might not be any more efficient than pulling from an indexed BAM
val alignments = sc.loadAlignments("sample.bam")
alignments.saveAsPartitionedParquet("sample.partitioned.adam")
val bin = sc.loadPartitionedParquetAlignments("sample.partitioned.adam", regions = Seq(...))
jamesthegiantpeach
@jamesthegiantpeach
Hmm I think that almost works. I think I would need to figure out how to "slop" those partitions in a way similar to this (https://bedtools.readthedocs.io/en/latest/content/tools/slop.html)
Michael L Heuer
@heuermh
That is how our pipe APIs work, flankSize is the slop between partitions sent to external commands
jamesthegiantpeach
@jamesthegiantpeach
OH! Ok I just need to familiarize myself with the API more. Thanks
Michael L Heuer
@heuermh
The partitions in Hive-style region partitioned Parquet are the foldered partitions on disk, not necessarily how Spark might partition, overlapping terminology unfortunately
jamesthegiantpeach
@jamesthegiantpeach

I sometimes get this error when running bwa and i'm not sure why:

Caused by: java.util.NoSuchElementException: key not found: sample
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:59)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:59)
    at org.bdgenomics.adam.models.ReadGroupDictionary.apply(ReadGroupDictionary.scala:133)
    at org.bdgenomics.adam.converters.AlignmentConverter$$anonfun$convert$1$$anonfun$apply$1.apply(AlignmentConverter.scala:259)
    at org.bdgenomics.adam.converters.AlignmentConverter$$anonfun$convert$1$$anonfun$apply$1.apply(AlignmentConverter.scala:257)

It comes from this. Mid alignment the readgroupmap doesn't contain the sample name assigned.

Just a local test.

        val readFile1 = "/home/james/genomic_data/reads/test_1.fq"
        val readFile2 = "/home/james/genomic_data/reads/test_2.fq"
        val ref       = "/home/james/genomic_data/reference/human_g1k_v37.fasta"

        val args = new BwaArgs()
        args.indexPath = ref
        args.sample = "sample"

        val reads = sc.loadPairedFastqAsFragments(readFile1, readFile2)
        val alignments = reads.alignWithBwa(args)
        alignments.saveAsSam("/home/james/genomic_data/reads/test_1.bam", asSingleFile = true)
Michael L Heuer
@heuermh
Further up the stack trace, is this exception thrown during saveAsSam? I believe the alignWithBwa step is probably going fine, the problem may be that alignments does not have any read groups when trying to convert to BAM.
The cannoli bwa command line class does two extra steps after piping out to bwa, creating read groups and references
https://github.com/bigdatagenomics/cannoli/blob/master/cli/src/main/scala/org/bdgenomics/cannoli/cli/Bwa.scala#L102
jamesthegiantpeach
@jamesthegiantpeach

Interesting error

Exception in thread "main" java.lang.IllegalArgumentException: Fastq 1 (s3a://bgi-concordance-studies/run190704NovaSeq_R1_A501.fastq.gz.corrected.fastq.err_barcode_removed.fixed.fastq.gz) has 1023782058 reads, fastq 2 (s3a://bgi-concordance-studies/run190704NovaSeq_R2_A501.fastq.gz.corrected.fastq.err_barcode_removed.fixed.fastq.gz) has 1023782057 reads
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastq$1.apply(ADAMContext.scala:2536)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastq$1.apply(ADAMContext.scala:2513)
    at scala.Option.fold(Option.scala:158)
    at org.apache.spark.rdd.Timer.time(Timer.scala:48)
    at org.bdgenomics.adam.rdd.ADAMContext.loadPairedFastq(ADAMContext.scala:2513)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastqAsFragments$1.apply(ADAMContext.scala:3059)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadPairedFastqAsFragments$1.apply(ADAMContext.scala:3059)
    at scala.Option.fold(Option.scala:158)
    at org.apache.spark.rdd.Timer.time(Timer.scala:48)
    at org.bdgenomics.adam.rdd.ADAMContext.loadPairedFastqAsFragments(ADAMContext.scala:3057)
    at org.myome.bigbioinf.Main$.main(Main.scala:39)
    at org.myome.bigbioinf.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The true number of reads is 1023782052

jamesthegiantpeach
@jamesthegiantpeach

https://github.com/bigdatagenomics/adam/blob/dd99cad756ec486efdc896cb1cf7d5080a9c95dd/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L2533

Somehow this is conditional is failing even though i have confirmed that the number of reads are the same

jamesthegiantpeach
@jamesthegiantpeach

@A00741:47:HCM53DRXX:1:1101:1072:1000/1

Example bar code

not sure if the bar code is causing a miscount or what
Karen Feng
@karenfeng
When can we anticipate an ADAM release on Spark 3?
Michael L Heuer
@heuermh
Well, Spark 3 isn't out yet ;)
Michael L Heuer
@heuermh
The big problem is the utils-metrics module, which is deeply dependent on internal Spark classes. In the pull request above, I started down the path of removing the ADAM dependency on utils-metrics, but that is a big effort, and losing metrics would be unfortunate
Michael L Heuer
@heuermh
As an update, the utils-metrics module was deprecated in release 0.2.16 and will be removed in version 0.3.0, which will also cross-build Spark 2 and Spark 3.
ADAM version 0.31.0 will be released soon, with deprecations from utils-metrics 0.2.16 and others. Then ADAM version 0.32.0 will depend on utils version 0.3.0 and will also cross-build Spark 2 and Spark 3.
Michael L Heuer
@heuermh
Michael L Heuer
@heuermh
If there are any other issues/pull requests folks would like to see triaged to the 0.31.0 release, please let me know
Karen Feng
@karenfeng
Awesome, thanks Michael. When would you ballpark the release for 0.32.0?
Michael L Heuer
@heuermh
Not exactly sure, bdg-utils 0.3.0 and ADAM 0.32.0 should probably wait for Spark 3.0 proper instead of the preview releases
And I don't have access to any Spark 3 cluster environments for testing, e.g. AWS EMR is still at Spark 2.4.4
jamesthegiantpeach
@jamesthegiantpeach
How do you shard fastq bgz files without having an index? That is, how do you seek into a bgz fastq file and know that a block doesn't have half of a fastq record in it? Couldn't find the code in the repo
Michael L Heuer
@heuermh
Block-gzipped (bgz/bgzf) FASTQ files are supported by the code in this package https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/java/org/bdgenomics/adam/io
Most of it came in on this commit bigdatagenomics/adam@985e5d8
Jashan Goyal
@jashangoyal09
Hi folk, How can i get files in avro format from s3 buket?
Michael L Heuer
@heuermh
Hello @jashangoyal09, it depends on your use case. Would you be running Spark from AWS? on EMR?
Larry N. Singh
@larryns
hello, I'm trying to join an AlignmentDataset to a Dataframe. I've tried converting the AlignmentDatset to a DF, and joining. But I don't see an obvious way of converting the Dataframe back to an AlignmentDataset so I can save it as a bam. Alternatively if I can just join the AlignmentDataset to the Dataframe and get an AlignmentDataset that would work too.
I looked at the transformDataset and joining with that, but the conversion isn't quite right and I'm not sure if this is the best way. Any advice?
Michael L Heuer
@heuermh

In pseudocode the transform would look like

import org.bdgenomics.adam.sql.Alignment

val other = ...// some df
val alignments = sc.loadAlignments("sample.bam")
val joined = alignments.transformDataset(
  ds => ds.toDF().join(other).as[Alignment]
)

Another option is to use the AlignmentDataset ctr

import org.bdgenomics.adam.sql.Alignment

val other = ...// some df
val alignments = sc.loadAlignments("sample.bam")
val joinedDataframe = alignments.toDF().join(other)
val joinedDataset = joinedDataframe.as[Alignment]
val joinedAlignments = AlignmentDataset(
  joinedDataset,
  alignments.sequences,
  alignments.readGroups,
  alignments.processingSteps  
)
Larry N. Singh
@larryns
@heuermh , thanks for pseudocode. What you wrote is basically what I ended up writing, so it's good to have the confirmation that I did it the right way.
Thanks again!
Michael L Heuer
@heuermh
Good to hear!