Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Dec 17 2019 20:03
    mohammedmsk opened #41
  • Sep 30 2019 00:48

    twestbrookunh on master

    Update README.md (compare)

  • May 14 2019 02:57

    twestbrookunh on v1.4.6

    (compare)

  • May 14 2019 02:56

    twestbrookunh on master

    Release/1.4.6 (compare)

  • May 07 2019 21:21

    twestbrookunh on v1.4.5

    (compare)

  • May 07 2019 21:20

    twestbrookunh on master

    Release/1.4.5 (compare)

  • Mar 14 2019 15:44
    jaidevjoshi83 edited #40
  • Mar 14 2019 15:33
    jaidevjoshi83 opened #40
  • Feb 28 2019 04:10
    twestbrookunh closed #39
  • Feb 28 2019 04:10
    twestbrookunh commented #39
  • Feb 22 2019 19:58
    gundizalv opened #39
  • Feb 22 2019 04:38
    gundizalv commented #31
  • Feb 22 2019 04:37
    gundizalv commented #31
  • Feb 22 2019 04:37
    gundizalv commented #31
  • Feb 22 2019 04:28
    gundizalv commented #31
  • Feb 22 2019 04:26
    gundizalv closed #38
  • Feb 19 2019 19:33
    gundizalv commented #38
  • Feb 15 2019 18:54
    twestbrookunh commented #38
  • Feb 15 2019 17:04
    gundizalv commented #31
  • Feb 15 2019 16:59
    gundizalv commented #38
Toni Westbrook
@twestbrookunh
Jordan, what version of the similarity script should I be using for BWA aligned against SwissProt?
Jordan Ramsdell
@Vadavim

So everything should work fine for the uniref90 sam files. I had to tweak listMapped because the indexes were a little off after splitting. I ran through align.sam and it seemed to go okay. The script is in /home/genome/airjordan/bin/python_libs/similarity_tester.py

Ahhh, as for SwissProt... I think before I changed the script, it could be used on that, but I had to change the way I parsed the lines for the Uniref samples (and made some tweaks in listmapped), so I don't think it'll work anymore on the older data.

I can make a separate version that can work for those. Do you have the filepath for a sam file I can test on?

Toni Westbrook
@twestbrookunh
Sounds good, I'll try the script out on the UniRef SAM soon. For the BWA SwissProt, an example file can be found here: /home/genome/anthony/local_projects/paladin/test-bwa/bwa-finished_uniprotkb.fna-k19-T30-B1-O6-E1-L6.sam - would it be possible to have 4 copies of the script (named differently) for PALADIN+SwissProt, PALADIN+UniRef, BWA+SwissProt, and BWA+UniRef? I'm starting to run a bunch more tests back and forth, so that would be helpful - let me know if that is doable? Thanks!
Jordan Ramsdell
@Vadavim
Sure, it's doable. I've just been in the midst of getting ready for a midterm, but I'll get to it as soon as I can.
Toni Westbrook
@twestbrookunh
I have the UniRef-90 results back (indexing finished up late last night) - very interesting results. The mapped percentage was %97.45, which is only a fraction under what aligning it against the original references is (%97.68). Additionally, the top 6 representative organisms from the aligned clusters are our exact test set - ecoli, pseudomonas, acidovorax, micrococcus, halobacillus, and staphylococcus (and all the other high matches are extremely close phylogenetically to those 6). So, it looks like we have a sweet spot with UniRef-90 where the clustering reduces the reference size to something we can index, but 90% similarity creates for narrow enough clusters for us to get more accurate results. I'm going to proceed with using this reference (unless anyone thinks otherwise) to run the tests we discussed in yesterdays meeting. Thanks -
Matt MacManes
@macmanes
awesome news!
Jordan Ramsdell
@Vadavim

So I think the similarity_test script should be all set up now to work with uniref/swissprot and bwa/paladin. I've tested it with BWA+Swissprot, Paladin+Swissprot, and Paladin+Uniref90. Is there a BWA-Uniref90 sample somewhere? It should work fine on it, though.

The script now takes two arguments. The first is the type of mapper ("paladin" or "bwa") and the second is the sam file to parse through. I've also added usage information just in case.

Anywho, you can still find it in the same spot: /home/genome/airjordan/bin/python_libs/similarity_tester.py

Toni Westbrook
@twestbrookunh
Hi Jordan - thanks for tackling this. So, I'm running it on a few SAM files right now (both BWA and PALADIN), but as a test, I ran it on one I already had similarity scores for, just to double check to make sure I was still getting the same score with the new script. However, I'm getting slightly (within 1%) different scores with the new script than the old - for example, before on a paladin SAM run against swissprot I'd get 0.912, 0.878, and 0.904 for the 3 ontology type similarities. With the new script, I'm getting 0.907, 0.875, and 0.902. Again, the differences are tiny, but I just wanted to double check that you made a known change that would account for the slight calculation difference, or if it was unknown and something we needed to investigate. Thanks again -
Jordan Ramsdell
@Vadavim
Hi Anthony, thanks for pointing that out. I'll look into my changes to see if something might have changed how I calculated it in the end. Have you tested it on other ones to see if they're also within 1%-ish?
Matt MacManes
@macmanes
hey @twestbrookunh how much longer you think that Paladin run has on Pinky? I need to use for the next few days.. Can i use 16 threads and you use the other 16? Don’t kill the existing run..
Matt MacManes
@macmanes
Toni Westbrook
@twestbrookunh
Sorry, forgot to respond back earlier - my big paladin run (like you probably saw) finished up on Pinky - if I do any more I'll be sure to confine it to 16 threads or less
Toni Westbrook
@twestbrookunh
Just wanted to make sure someone knew that the root partition on Brain is filled up - bash autocomplete and other stuff is broken because of it.
Toni Westbrook
@twestbrookunh
Jordan - just a heads up, I was starting to rerun some of the tests using the new mined swissprot you put in my incoming directory - it did worse with PALADIN than the previous mined swissprot (ie the filtered version that had the 6 genomes removed) - it went down from 29.34% to 25.78%. After checking, there are a lot fewer entries in the file too, the last one had 528,000, this one has 373,000. Let me know if I'm misunderstanding something though. I'm going to hold off on retesting the swissprot portions until I confirm with you. Thanks
Toni Westbrook
@twestbrookunh
Also - a quick update on Novoalign. I did a lot of tests with crafting many permutations of a 10 nucleotide string (anything smaller than that, and while it does index properly, Novo will always fail with a read quality error during alignment). While there doesn't appear to be an easy rule "e.g. must have X number of non-ambiguous nucleotides to seed properly", there is definitely a relation with how many AND what types of ambiguous IUPAC codes there are in the string, and if it will seed/align. The degree to how ambiguous it is, if the ambiguities are contiguous, and how many there are all seem to have an effect. One example, if you have two contiguous 3-way (e.g. V, H, D, or B) and/or 4-way ambiguous IUPACs (e.g. N), it will fail, while replacing one of them with a 2-way (e.g. M, R, W, S) will succeed. Having them non-contiguous will succeed, unless there are too many, which then fails, etc etc. So not as simple as just a seed length, but it definitely suggests it's related to some type of statistical analysis of exactly how ambiguous the seed is. That would make it difficult to overcome without knowing the exact underlying algorithm, but at least we can outline this in the paper and use the non-coding improvement example as proof of understanding the issue and what the theoretical fix would be. Let me know if anyone else wants me to look at anything else Novo related too
Toni Westbrook
@twestbrookunh
One more note - good news - I filtered that CF lung read set for 250BP reads, and it aligned at 92.63%, up from %86.29. And as reference, that's out of 1,152,265 detected ORFs from the reads, so this is in the million+ range of CDS. As a quick check, I also did a count/sort on the proteins returned - a lot of Streptococcus and other bacteria that (according to the literature) can be prevalent in CF cases - so looking good!
Jordan Ramsdell
@Vadavim

So blasting with 28 threads took around 8 hours and 30 minutes for 8000 sequences. If we look at those that exist in both the blast and in paladin's mapping, then there are a total of ~6400 comparable sequences.

Of those, 5047 matched perfectly between paladin and blast. Of the 1396 that did not match perfectly, 189 had identical go terms, and the average similarity score was 0.66

Toni Westbrook
@twestbrookunh
Awesome - so that means (assuming linear scaling @ 28 cores) for BLAST, we're looking at 8000seq/510min = 15.686 sequences per minute? So, the full set that took 31 hours for PALADIN was 238,528,490 ORFs, so that many sequences for BLAST would be 15,206,191 minutes, or 28.93 years. (is that math right?) So - that's a pretty good improvement! Also, Jordan, did you see my Uniprot question above?
Jordan Ramsdell
@Vadavim
Hi Anthony, sorry about the late reply. Today was crazy! Yes so basically I was strict this time in terms of what I included as CDS, that's why I wanted to double check with you rerunning paladin. I only chose annotated genomic CDS (not mRNA). If I expand it to mRNA, I can probably up those numbers. They'll be slightly less in the end because I did not also try to include RefSeq versions of the CDS if they could not be found on EBI (it's more trouble than it's worth). So if you'd like, I can rerun the swissprot download script and include mRNA
Toni Westbrook
@twestbrookunh
Let's double check with the rest of the group on Tuesday to see which way we want to go. Also, just a heads up for everyone, I just posted the version 1.1.0 branch on Github. This version does version checking (will prevent incompatible old index versions, and give a warning for future index versions), has a built in option for preparing the UniRef90, and also allows you to use a pre-downloaded and/or pre-indexed reference for the "prepare" command. That way, if someone spent all the time indexing the UniRef90 without having the "prepare" command do it, they can run the "prepare" command after, and it will fix up both the reference and the ANN file (from the index) to use the representative ID for the sequence header name, without having to reindex. Also, this version does checking if the user requests a UniProt report to make sure the reference was prepared as a UniProt reference (which is why I included all the functionality to make sure a user didn't get themselves into the situation where they accidentally had to index something twice). I'm going to keep it as a branch until I test a bit more, then will merge. Thanks
Jordan Ramsdell
@Vadavim
Sounds good!
Toni Westbrook
@twestbrookunh
Just a heads up, I have some interesting data for today's meeting - found out new details when comparing BWA to UniRef90. Taruna, I will be sending over the Excel spreadsheet in the next couple hours so you have it - it has all the data for the paper. Thanks!
Toni Westbrook
@twestbrookunh
As a follow up to loading the index from shared memory - I did some testing and looking at the code tonight. The bad news is, it crashes when used, but the good news is, it gets far enough in that I can tell that if it does work, it would drastically speed up the alignment process if you were doing smaller read sets (especially batches of smaller read sets - you could keep the index loaded in memory for as long as the machine stayed online). It loads the index in almost instantly as opposed to taking 10 mins or so. After looking at the source, I think I'm able to fix why it's crashing, there are some hardcoded values related to 32-bit operation and some other stuff I'd need to fix up to 64-bit. Hopefully I can give it a try this weekend and incorporate the fixed version into the 1.1.0 branch
Toni Westbrook
@twestbrookunh
If someone has a moment before the meeting today, on brain, can someone run: "/home/genome/anthony/repos/paladin/paladin shm -l" and see if the first line it returns is "uniref90_filtered_aa.fasta.gz"? I want to double check it works from a different account. Thanks -
(that "-l" up there is a dash L, not a dash one, btw)
Jordan Ramsdell
@Vadavim
Trying it now
This message was deleted

It returns:

uniref90_filtered_aa.fasta.gz 53198279679
[main] Version: 1.1.1
[main] CMD: /home/genome/anthony/repos/paladin/paladin shm -l
[main] Real time: 0.000 sec; CPU: 0.006 sec

So yes.

Jordan Ramsdell
@Vadavim

Also, some updated stats about the blast hits vs. paladin's mapping with the 8000 random read dataset:

Total unique blast hits: 7894
Total reads that had hits in paladin: 6435

Paladin did not have any hits unique to it, so there were 1459 reads that had a blast hit but did not have a mapping from paladin.

Toni Westbrook
@twestbrookunh
Hey Jordan - just a heads up, if you do want to double check the test set against the uniref90 filtered so it's completely apples to apples, both the normal and the T10 run are in the directory with the rest of the tests (directory name paladin-test-uniref90-filtered).
Matt MacManes
@macmanes
hey @twestbrookunh: The install test is failing.
make
cd sample_data && \
    sh ./make_test.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 5473k  100 5473k    0     0  20.5M      0 --:--:-- --:--:-- --:--:-- 20.5M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7077  100  7077    0     0   138k      0 --:--:-- --:--:-- --:--:--  141k
[M::command_index] Translating protein sequence...0.00 sec
[M::command_index] Packing protein sequence... 0.00 sec
[M::command_index] Constructing BWT for the packed sequence... 0.00 sec
[M::command_index] Updating BWT... 0.00 sec
[M::command_index] Packing forward-only protein squence... 0.00 sec
[M::command_index] Constructing suffix array... 0.00 sec
[main] Version: 1.1.0
[main] CMD: ../paladin index -r2 paladin_test.faa
[main] Real time: 0.024 sec; CPU: 0.007 sec
[M::command_align] Loading the index for reference 'paladin_test.faa'...
[M::index_load_from_disk] Read 0 ALT contigs
[E::command_align] Reporting can only be used on prepared indices.


OOPS: SOMETHING WENT WRONG WITH INSTALLATION, OR YOU ARE NOT CONNECTED TO THE INTERNET
I assume this is something with the usage has changed, rather than an installation issue
Toni Westbrook
@twestbrookunh
This should be good to go now - this was because you can no longer request a report (without a manual prepare) on a non-uniprot reference. However, I updated the script so it runs a manual prepare on that test reference, specifying it as a SwissProt-style reference.
Toni Westbrook
@twestbrookunh
Heads up, I looked for Kelley's paragraph on the speed vs BLAST, but I couldn't find it anywhere (I feel like maybe he showed it on his laptop on the projector or printed it out or something - I don't think a distributed copy of it exists), so I did a new one for the paper. Feel free to make any edits (I had one comment in there too about it). Also, I did a few cleanups on a couple of sentences that got chopped up a bit during all of our edits, and added a few more sentences to the supplemental. I think this is our todo list, feel free to mention anything I missed:
  1. Conclusion
  2. Caption on table 1
  3. Completing out references
  4. Acknowledgements
  5. Final spelling/grammar/spacing/tense check
This message was deleted
Toni Westbrook
@twestbrookunh
I cleaned up the table a little bit on the row headers so they weren't two lines, and cleaned up the vertical alignment so things matched
Toni Westbrook
@twestbrookunh
@macmanes - it looks like the github URL got removed from the abstract sometime between the 12/23 version of the document and the 1/10 version through everyone editing it. I think that was a mistake (and I'll add it back in), but I wanted to check with you if you remember having a conversation with anyone in our meetings that this was done on purpose? I don't remember that, but I wanted to double check
Toni Westbrook
@twestbrookunh
Oh, I think it was taken out because it was over the 70 word limit. However, with all the edits (regardless if we use the original or Dan's version), we can put it back in and be under 70. I'll do it for now, it would be nice to have the URL easily accessible so people aren't hunting for it
Jordan Ramsdell
@Vadavim
So should I hold off on editing for now?
Matt MacManes
@macmanes
did @twestbrookunh give you the OK last night?
Jordan Ramsdell
@Vadavim
Oh, hmm, I guess not. I was just worried I had missed something.
Toni Westbrook
@twestbrookunh
Jordan, you should be good to go. If you can copy the current manuscript into the old_stuff folder with the same format (Manuscript-date-revision), and then you can edit the current file
Jordan Ramsdell
@Vadavim
Quick question... are we supposed to mention what PALADIN stands for in the manuscript? I can't to see it mentioned in that or in the methods, although I may have missed it.
Matt MacManes
@macmanes
isnt it in the title?
Fredrik Boulund
@boulund
Hey, is there anyone monitoring this room? I've a question regarding the output of PALADIN on a gut metagenome I just briefly tested it on.
Matt MacManes
@macmanes
Whoops, yes @boulund, tho obviously not very well. Apologies for that. Can @twestbrookunh or I still help you?
Fredrik Boulund
@boulund
I don't know. Maybe :)
It looks like the output contains a lot of eukaryotic hits, which seems a bit strange for a gut metagenome. No other methods identify a majority of drosophilia, slime molds, worms, frogs, and some plants... Have you seen this before?
Toni Westbrook
@twestbrookunh
Hi @boulund - sorry about that, I haven't been good about monitoring this room either. For your issues with strange hits, are you making use of the Uniref90 for your reference, or a different database?
Fredrik Boulund
@boulund
Hi again, @twestbrookunh @macmanes, half a year later... Sorry about that :)
Turns out I messed up when constructing the database, so I only downloaded the bacterial references, but forgot to actually include them when building the database. Whoops :D. Problem solved :+1: