Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    ChristopherWilks
    @ChristopherWilks
    @noemidi91, @nellore is correct, the only way to do everything is to use the junctions.bgz for a particular compilation, any other approach is going to either be study specific (as in recount3) or heavily filtered (snapcount). Ultimately this is because the junction matrices for a compilation like srav3 are massive (228,000,000 x 315,000) if fully materialized as dense matrices and are still quite large even if stored as sparse.
    Also, while the sqlite3 databases are available for most/all compilations (copies of what's in junctions.bgz in a relational format), for the very large compilations in recount3 (srav3, tcgav2, and gtexv2) queries tend to take a very long time when run on them, so it's best to use the bgzipped formatted files in a streaming mode as @nellore suggested to avoid that.
    noemidi91
    @noemidi91
    thanks a lot!
    ChristopherWilks
    @ChristopherWilks
    just a heads up to anyone using the public Snaptron service(s), we're taking it offline temporarily while we work through the implications of potentially being affected by the recent Log4j vulnerability https://www.randori.com/blog/cve-2021-44228/
    Anna-Leigh Brown
    @aleighbrown
    What are the columns for "junctions.bgz"
    | 0 | ERCC-00002 | 27 | 85 | 59 | + | 0 | GC | AG | 0 | 0 | ,667693:31 | 1 | 31 | 31 | 31 | 0 |
    |---|------------|----|-----|-----|---|---|----|----|---|---|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|-----|-------|----|---|
    | 1 | ERCC-00002 | 28 | 92 | 65 | - | 0 | CT | AC | 0 | 0 | ,796958:1 | 1 | 1 | 1 | 1 | 0 |
    | 2 | ERCC-00002 | 28 | 100 | 73 | - | 0 | CT | AC | 0 | 0 | ,2105462:5 | 1 | 5 | 5 | 5 | 0 |
    | 3 | ERCC-00002 | 28 | 381 | 354 | - | 0 | CT | AC | 0 | 0 | ,1070959:3,1082155:2,2165778:3,2167060:5,922139:5
    oh that didn't format correctly at all
    but that's just calling head on it
    abhinav
    @nellore
    @aleighbrown they line up with the fields described here, starting with the second field: http://snaptron.cs.jhu.edu/reftables.html#table-5-complete-list-of-snaptron-fields-in-return-format
    ERCC-* refers to a spike-in, as described here: https://www.thermofisher.com/order/catalog/product/4456740#/4456740
    Anna-Leigh Brown
    @aleighbrown
    thanks!
    Etienne Kornobis
    @khourhin
    Hello, I would like to use Snaptron web interface to identify exon-exon junctions for the GPHN gene in mouse. I did not find in the manual an exemple of curl URL to get this information. Is it feasible to make mouse queries through the API ?
    abhinav
    @nellore

    @khourhin try this variant of an example:

    curl "http://snaptron.cs.jhu.edu/srav1m/snaptron?regions=GPHN&rfilter=samples_count>:100&rfilter=annotated:1"

    that is, the compilation you're looking for is srav1m

    Etienne Kornobis
    @khourhin
    Thanks a lot, I'll try !
    ChristopherWilks
    @ChristopherWilks
    thanks @nellore for responding!
    abhinav
    @nellore
    np!
    Anna-Leigh Brown
    @aleighbrown
    Sorry to ask again -regarding this file - http://snaptron.cs.jhu.edu/reftables.html#table-5-complete-list-of-snaptron-fields-in-return-format are the start and end coordinates 0 or 1 based?
    Anna-Leigh Brown
    @aleighbrown
    one more q - what's not clear to me from snapcount R package is if you can query for a specific event and get the samples containing that event. E.g. I can try to query for a specific splice junction my_junc <- QueryBuilder(compilation = 'srav2',regions = 'chr8:79611215-79616821') but you will get back all the events that also overlap
    you can then look at the samples with those overlaps in the colData my_junc@colData but it's not super obvious that there's a way to get the samples that had the one range I was looking for out of the 30 that were returned.
    Maybe there's a way to retrieve samples containing a snaptron_id?
    Anna-Leigh Brown
    @aleighbrown

    OO - figured it out - could recommend including this in the vignette as IMO the function names and order of operations seemed a bit unclear.
    query <- QueryBuilder(compilation = 'srav3h',regions = 'chr8:79611215-79616821') query = set_row_filters(query,snaptron_id == 72542235) my_juncs <- query_jx(query)

    So you can get the exact snaptron id and set a filter afterwards

    or ofc just set the row filters on exactly the coordinates required
    abhinav
    @nellore

    @aleighbrown table 1 of http://snaptron.cs.jhu.edu/reftables.html says region coordinates are 1-based in a query; this is also true of returned coordinates.

    a more straightforward way to perform a single-junction query can perhaps be inferred from p. 4 of http://www.bioconductor.org/packages/release/bioc/manuals/snapcount/man/snapcount.pdf -- see the Exact field.

    thanks for using!

    hannahjacobs8
    @hannahjacobs8
    Hi! Just to be clear, in the file "http://snaptron.cs.jhu.edu/data/srav3h/junctions.bgz", for the "annotated" column, the 0 is annotated and 1 in unannotated?
    ChristopherWilks
    @ChristopherWilks
    hi @hannahjacobs8, 0 is unannotated and 1 is annotated, is there documentation where that's contradicted (that we need to change)?
    hannahjacobs8
    @hannahjacobs8
    Oh, it's just that when looking at that file, I thought I saw more 0's than 1's in that column, so I thought it might have been flipped because in your paper it mentioned that only 16% of junctions are found to be unannotated
    ChristopherWilks
    @ChristopherWilks
    ahh, yes Fig. 2 (and related paragraph), that was after requiring that the jx appear in >=5% of SRA run accessions in the organism before looking at annotated vs. unannotated, i.e. if you're looking at all jx's w/o filtering (as in that file) there's a lot of unannotated jx's which only appear in a few samples (or only one sample), those are probably noise (though not absolutely)
    hannahjacobs8
    @hannahjacobs8
    Ohh right that makes sense now. I think I will do the same filtering then. Thanks!
    noemidi91
    @noemidi91
    Hi, I have a very naive question: the read count spanning a given junction is obtained from STAR, right? Is it re-processed/filtered somehow by re-count3? thank you!
    ChristopherWilks
    @ChristopherWilks
    @noemidi91 yes, it's from STAR's own filtered list of junction calls as per the STAR manual "SJ.out.tab contains high confidence collapsed splice junctions in tab-delimited format", we did not do any additional filtering
    abhinav
    @nellore
    @ChristopherWilks 1-pass or 2-pass?
    ChristopherWilks
    @ChristopherWilks
    @nellore 1-pass only---we didn't run 2-pass at all in Monorail, though we considered it. IIRC running 2-pass dynamically (w/o a pre-determined set of jxns which we were trying to avoid) would've forced us to load non-shared copies of the reference index---per-sample---which would've significantly decreased the number of samples we were able to run on the same machine due to memory constraints.
    Ramsey Kamar
    @RamseyKamar
    Hi @ChristopherWilks or @nellore . I'm only using the part of the Unifier that creates the minimal files for a Snaptron instance (i.e. I set SKIP_SUMS=1). Do you have a tool for combining the outputs from multiple Unifier runs into one? So is there a way I could generate a single set of junctions.bgz, junctions.bgz.tbi, junctions.sqlite, samples.tsv, samples.fields.tsv, lucene_full_standard, lucene_full_ws, and lucene_indexed_numeric_types.tsv from two separate runs? I'm asking because my group is interested in Unifying SJ.outs which live on separate servers on opposite sides of the Atlantic. If we could avoid transferring all the SJ.out files from one server to the other that would be nice (i.e. it would be nice to run Unifier separately on each side of the Atlantic and combine the results). Thanks!
    Ramsey Kamar
    @RamseyKamar
    I suspect this is possible since I doubt you rerun the Unifier on all samples when you get access to new samples for a compilation.
    abhinav
    @nellore
    so i don't know if that was ever implemented (we'll wait for @ChristopherWilks for a definitive answer), but i used to merge junctions.bgz-like files in the days of recount2; i'd just cat the multiple files to GNU coreutils sort to sort by junction (i believe this will be -k3,3 -k4,4n -k5,5n -k7,7 for junctions.bgz), then pipe the result to a script that merged lines corresponding to the same junction. should be possible to obtain the other database files starting there. use --parallel=n to use n threads when sorting
    Ramsey Kamar
    @RamseyKamar
    @nellore I'm seeing now that maybe these tools do exist based on the README of your internal version of the Unifier (https://github.com/langmead-lab/recount-unify). But I suppose I would need like a docker image with all the necessary tools pre-built (especially Lucene!) to get it to work....and maybe a little instruction on how to run it :)
    ChristopherWilks
    @ChristopherWilks
    @RamseyKamar the general outline that @nellore described is essentially what you'll need to do. That said, a lot of this is already handled in what exists (as you stated in your response) but will require some re-working from you. Mainly you'll need to stop the main jxn unify process before it does the final merge + annotation run for each set of SJ.outs file and then run that step for all the partially merged sets together on one machine. You should treat the initial merges of SJ.out files as coming from >=2 separate studies for this purpose---whether they're actually different or not doesn't matter as long as you ensure that there's no duplicates in the sample rail_ids between the various sets of SJ.out files i.e. the sample rail_id space must be disjoint so that you can safely merge them all together at the end w/o the same rail_id referring to two distinct samples.
    ChristopherWilks
    @ChristopherWilks
    actually, it's easier to let the full unify process run for each set of SJ.out files (geographically separated as you said), then extract the intermediate *.sjs.merged.motifs from the unify run directory from each of the 2 sets files, copy those to one machine so you have them together, then re-run the final steps of the jxn unify pipeline to get the fully merged set of jxns.
    the steps are: collect_per_study_sjs=>merge_per_study_sjs=>collect_study_merged_sjs=>merge_all_sjs=>annotate_all_sjs (final step produces the full set of merged jxns with motifs and annotations)
    it's a bit heavyweight since it was designed to merge across many studies at once so there's room for efficiency shortcuts, but I'd start with that process first since there are certain formatting assumptions that may be made across steps.
    ChristopherWilks
    @ChristopherWilks
    also IIRC I did this w/o using Snakemake at some point (i.e. the efficiency shortcuts mentioned above), so as an alternative, checkout this script which was built to do what you asked, more or less (use caution though, you may need to modify it appropriately so make sure you read through it and understand it thoroughly before applying it, it isn't plug-n-play): https://github.com/langmead-lab/recount-unify/blob/master/merge/super_jx_merge_for_snaptron.sh
    in your case, "tranche" would be each of the 2 geographically separate sets of SJ.out files run through the full unifier process already
    Ramsey Kamar
    @RamseyKamar
    Thank you @ChristopherWilks and @nellore ! Let me chew on this, see if I can figure it out, and get back to you.
    David McGaughey
    @davemcg
    Hi @ChristopherWilks (and company) - I'm curious whether I can use "snapcount" with RSE imported with recount3 (e.g via rse_jx <- create_rse_manual('SRP002881', type = 'jxn', annotation = 'gencode_v29')?
    David McGaughey
    @davemcg
    Hmm, so, the answer is "no, not directly." It appears that snaptron query_jx returns extra (and important!) columns that have aggregated info not present in the recount3 RSE: samples_count, coverage_sum, coverage_avg, coverage_median, source_dataset
    David McGaughey
    @davemcg
    I apologize, just discovered that unifier creates the files: https://github.com/langmead-lab/monorail-external#unifier-outputs
    Of course now I need to figure out how to link the unifier outputs into snapcount....I assume this is possible?
    David McGaughey
    @davemcg
    ah, I need to setup a docker / snaptron instance...hmm this is heavier than I'd like....(i do understand why this is desired for your use case)
    ChristopherWilks
    @ChristopherWilks
    Hi @davemcg, sorry for the tardy response, as you found, we didn't originally intend to link recount3 jxn RSE's and snapcount methods directly (though it's reasonable to think they could be linked).
    Also, as you're finding, snapcount needs a Snaptron server instance to query against, which it sounds like for your case may be overkill. I'm assuming you'd just like to directly load up the jxn files into a snapcount RSE?
    I would be interested in hearing more detail about your use case, either here or you can email me (broadsword@gmail.com). No guarantees about providing a solution at this point, but it could inform future updates.