Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Sep 23 18:30
    Jenkins ADAM-prb failure
  • Sep 23 18:30
    AmplabJenkins commented #2212
  • Sep 23 17:45
    Jenkins ADAM-prb started
  • Sep 23 17:41
    akmorrow13 opened #2212
  • Sep 23 17:29

    akmorrow13 on maint_spark2_2.11-0.29.0

    prerelease test remove broken dev tag code from… throws error when pypandoc fails and 2 more (compare)

  • Sep 20 16:35
    heuermh opened #2211
  • Sep 19 15:20
    Jenkins cannoli-prb success
  • Sep 19 15:18
    Jenkins cannoli-prb started
  • Sep 18 20:13

    heuermh on master

    Modifying changelog. Bumping R version to 0.29.0. Modifying pom.xml files for new… (compare)

  • Sep 18 20:13
    heuermh closed #2210
  • Sep 18 20:13
    Jenkins ADAM-prb success
  • Sep 18 20:13
    AmplabJenkins commented #2210
  • Sep 18 19:22
    Jenkins ADAM-prb started
  • Sep 18 19:17
    heuermh opened #2210
  • Sep 18 19:17
    heuermh milestoned #2210
  • Sep 18 18:58

    heuermh on maint_spark2_2.12-0.29.0

    [maven-release-plugin] prepare … (compare)

  • Sep 18 18:58

    heuermh on adam-parent-spark2_2.12-0.29.0

    (compare)

  • Sep 18 18:58

    heuermh on maint_spark2_2.12-0.29.0

    Modifying pom.xml files for Spa… [maven-release-plugin] prepare … (compare)

  • Sep 18 18:28

    heuermh on maint_spark2_2.11-0.29.0

    [maven-release-plugin] prepare … (compare)

  • Sep 18 18:28

    heuermh on adam-parent-spark2_2.11-0.29.0

    (compare)

Michael Adkins
@madkinsz
I think Spark uses Arrow already?
Pulling a file into memory using PySpark or something almost definitely uses Arrow to move it from Java -> Python
pauldwolfe
@pauldwolfe
thanks! the only docs I can find on that are indeed for pyspark (https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html), but even there looks like its an optional optimization?
Wonder what happens if you spark.conf.set("spark.sql.execution.arrow.enabled", "true") in a JVM only spark job
Michael L Heuer
@heuermh
Arrow, Parquet, and Spark are all coming together slowly over time (e.g. parquet-cpp has been taken over by Arrow); I wouldn't think there'd be much reason to develop any genomics I/O for Arrow directly.
Plus doing so is soul-crushing suck, and I wouldn't wish it on any university students
pauldwolfe
@pauldwolfe
Ha! Ok clear! Was thinking I might put you guys in touch, but sounds like you've already thought this through :)
Michael L Heuer
@heuermh
If Spark overhead is too much for your use cases, our lab is also developing https://github.com/ray-project/ray and https://github.com/modin-project/modin
Michael Adkins
@madkinsz
Ray is super cool
Michael L Heuer
@heuermh
I've been working with a Google Summer of Code student looking at adding Parquet support to https://github.com/biod/sambamba, to improve single-node performance, but they've been a bit stuck in the weeds of Parquet support for the D language
pauldwolfe
@pauldwolfe
I'll have a look, thanks guys! This wasn't so solve any immediate problem for us, just some students we met looking to try their wares somewhere.
Michael L Heuer
@heuermh
Yep, Ray is super cool, to the extent that there are very few students in our lab working on Spark any more ;)
pauldwolfe
@pauldwolfe
definitely not the "coolest" but sambamba on parquet would be super-handy
Michael L Heuer
@heuermh
Yeah ADAM can't touch sambamba performance single-node, it's only at scale that we do better
I'll be meeting with the other GSoC mentors on that project at BOSC this week, will see if more progress has been made
pauldwolfe
@pauldwolfe
We still use it to index(bai) our BAMs as well
Michael L Heuer
@heuermh
Have you looked at https://github.com/disq-bio/disq for that use case? I'm presenting Disq at BOSC; it is the replacement for Hadoop-BAM, is being used by GATK4 Spark now, and will most likely be used by ADAM in the near future.
As long as there isn't a regression in performance, that is. I've been working on some benchmarks for a comparison slide https://github.com/heuermh/benchmarks
I don't have a public branch for ADAM+Disq integration ready yet, though. There are some conflicts in Kyro serialization if you have both in the runtime classpath at the same time
pauldwolfe
@pauldwolfe
Nope, haven't really spent much time on that, been working more downstream the last little while. Will keep an eye on it, thanks!
Next time I should just push instead of saying something's not ready yet ;)
pauldwolfe
@pauldwolfe
:)
Xin Wu
@alartin
vote for disq! Hi guys, I just wonder in the paper ADAM pointed out the disadvantages of hadoop-bam and then there is a dependency of hadoop-bam in ADAM, I am confused that if hadoop-bam did not work well, why ADAM is dependent on it? Can someone help?
Michael L Heuer
@heuermh
Hello @alartin! Disq was started last year by ADAM developers and other collaborators specifically to address the shortcomings of Hadoop-BAM and provide a venue for further collaboration. At this point, Disq is an improvement on Hadoop-BAM on its own, but not necessarily better than Hadoop-BAM + workarounds+performance improvements+additional features provided by ADAM.
Anton Kulaga
@antonkulaga
disq cannot write to NIO, that makes it very limiting
Michael L Heuer
@heuermh
@antonkulaga What is your use case, write via NIO to which filesystem(s)?
Anton Kulaga
@antonkulaga
@heuermh I need to give results to my colleagues and collaborators who do not use Scala/Spark as well as distributed file systems
Michael L Heuer
@heuermh
So writing from Spark to local disk via NIO?
Anton Kulaga
@antonkulaga
yes, they are used to deal with bam/CRAM and expect me to send results in those formats. Also, people outside of the lab are not provided access to our Spark cluster (it is a policy of my boss), so cram/bam are the ways how I can send the files to them. Sometimes parquet (if squashed to one file with coalescence(1)) also works
there are also many bioinformatic tools that need bam-s as inputs. So to feed them pre-processed results I have to save them to bam
Michael L Heuer
@heuermh
Right, I understand the need to write to native file formats. Writing them to local disk in ADAM+Hadoop-BAM or Disq via the Hadoop local file system seems to work reasonably well for me. The only place I've found where fast-concat doesn't work is our CDH-provided encrypted HDFS.
Would any of those downstream tools that require BAMs be good candidates for Cannoli/ADAM Pipe API? The more you can get done on the Spark cluster, the better ;)
pauldwolfe
@pauldwolfe
hey all, having some performance issues with markdups we were hoping to get some tips on. Basically we're processing a patient who on first pass failed our internal QC, so our lab resequenced and combined the reads from both sequencing runs into one FASTQ motherload. About 576 gigabase
We use ADAM to align and markdups, then persist the BAM in a similar way to how @heuermh describes above
What happens is one task gets stuck in a GC tailspin, eventually times out and the whole job comes crashing down
I guess there is one region with super high duplicate read rate/coverage which is causing this
But a bit hard to tell before we've aligned :)
pauldwolfe
@pauldwolfe
Looking for any tuning tips, we're still on ADAM 0.26 and Spark 2.3.3 so perhaps and upgrade might help?
Unfortunately can't share the data causing the issue, privacy
Michael L Heuer
@heuermh
I haven't seen any performance improvements between Spark 2.3.x and Spark 2.4.x, not sure that would help much.
Michael L Heuer
@heuermh
If you have one large or several large tasks, repartitioning may help. Adding a persist to Parquet on disk and read from Parquet step before and/or after alignment may also help. I haven't tested this, but it seems like it would be more efficient for Spark to repartition a Parquet dataset than something going through Hadoop-BAM code paths.
For markdup, pull request bigdatagenomics/adam#2045 had better performance in some but not all performance test cases (that is why it has not been merged). It may work well in your case.
pauldwolfe
@pauldwolfe
Thanks @heuermh . We do have a large dataset with a lot of dups. Actually in the end we ran it through sambamba just to get it processed, and found it fails our own coverage QC.
What's the plan for bigdatagenomics/adam#2045 ? Would prefer to always depend on an official ADAM release
(although don't mind giving it a try to see if does help)
Michael L Heuer
@heuermh
Jon gave it a few rounds of tries, but its performance wasn’t always better than the current implementation, due to different shuffle characteristics. It was generally better on single node but scaled much worse.
It wouldn’t be too difficult to rebase the PR to keep trying, I can do that when I get back from the lake later this week.
Then the closer your project can track git head, the more folks I have keeping me honest wrt refactoring towards version 1.0, the better :)
Michael L Heuer
@heuermh
All, we're planning on cutting a 0.29.0 release tomorrow. The only issue left in the 0.29.0 milestone is regarding an htsjdk 2.19.x to 2.20.x update, though it looks like that may require a new release of Hadoop-BAM, and so is likely to be deferred to a later ADAM release. Please let me know if there are any other issues that you'd like us to try to get in before then.