So everything should work fine for the uniref90 sam files. I had to tweak listMapped because the indexes were a little off after splitting. I ran through align.sam and it seemed to go okay. The script is in /home/genome/airjordan/bin/python_libs/similarity_tester.py
Ahhh, as for SwissProt... I think before I changed the script, it could be used on that, but I had to change the way I parsed the lines for the Uniref samples (and made some tweaks in listmapped), so I don't think it'll work anymore on the older data.
I can make a separate version that can work for those. Do you have the filepath for a sam file I can test on?
So I think the similarity_test script should be all set up now to work with uniref/swissprot and bwa/paladin. I've tested it with BWA+Swissprot, Paladin+Swissprot, and Paladin+Uniref90. Is there a BWA-Uniref90 sample somewhere? It should work fine on it, though.
The script now takes two arguments. The first is the type of mapper ("paladin" or "bwa") and the second is the sam file to parse through. I've also added usage information just in case.
Anywho, you can still find it in the same spot: /home/genome/airjordan/bin/python_libs/similarity_tester.py
So blasting with 28 threads took around 8 hours and 30 minutes for 8000 sequences. If we look at those that exist in both the blast and in paladin's mapping, then there are a total of ~6400 comparable sequences.
Of those, 5047 matched perfectly between paladin and blast. Of the 1396 that did not match perfectly, 189 had identical go terms, and the average similarity score was 0.66
[main] Version: 1.1.1
[main] CMD: /home/genome/anthony/repos/paladin/paladin shm -l
[main] Real time: 0.000 sec; CPU: 0.006 sec
Also, some updated stats about the blast hits vs. paladin's mapping with the 8000 random read dataset:
Total unique blast hits: 7894
Total reads that had hits in paladin: 6435
Paladin did not have any hits unique to it, so there were 1459 reads that had a blast hit but did not have a mapping from paladin.
make cd sample_data && \ sh ./make_test.sh % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 5473k 100 5473k 0 0 20.5M 0 --:--:-- --:--:-- --:--:-- 20.5M % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 7077 100 7077 0 0 138k 0 --:--:-- --:--:-- --:--:-- 141k [M::command_index] Translating protein sequence...0.00 sec [M::command_index] Packing protein sequence... 0.00 sec [M::command_index] Constructing BWT for the packed sequence... 0.00 sec [M::command_index] Updating BWT... 0.00 sec [M::command_index] Packing forward-only protein squence... 0.00 sec [M::command_index] Constructing suffix array... 0.00 sec [main] Version: 1.1.0 [main] CMD: ../paladin index -r2 paladin_test.faa [main] Real time: 0.024 sec; CPU: 0.007 sec [M::command_align] Loading the index for reference 'paladin_test.faa'... [M::index_load_from_disk] Read 0 ALT contigs [E::command_align] Reporting can only be used on prepared indices. OOPS: SOMETHING WENT WRONG WITH INSTALLATION, OR YOU ARE NOT CONNECTED TO THE INTERNET