These are chat archives for dereneaton/ipyrad

5th
Mar 2016
Isaac Overcast
@isaacovercast
Mar 05 2016 01:04 UTC
Is step6 working for you?
  File "/usr/local/opt/anaconda/lib/python2.7/site-packages/ipyrad-0.1.71-py2.7.egg/ipyrad/assemble/cluster_across.py", line 427, in singlecat       
    icatg[iloc] = catarr[iloc, :icatg[iloc].shape[1], :] 
    raise TypeError("Can't broadcast %s -> %s" % (target_shape, count))                                                                              
TypeError: Can't broadcast (4, 4) -> (1, 150, 4)
Found the bug in step 3 for reference mapping. I'm cleaning it up and verifying it's fixed.
Deren Eaton
@dereneaton
Mar 05 2016 01:06 UTC
It will when I push my fixes. I broke it yesterday
Isaac Overcast
@isaacovercast
Mar 05 2016 01:19 UTC
Is the workshop tomorrow?
Isaac Overcast
@isaacovercast
Mar 05 2016 01:28 UTC
OK. There were two problems that were causing the high indels in refmapping. Problem 1) I was fsck'ing up the writing to clust.gz so the last unmapped cluster and the first mapped cluster were not separated by //. Easy fix.
2) The other thing that's causing high indels is an artifact of the hackish nature of the simulated mitogenome we're using. I generated this by randomly grabbing reads from the sim_rad data and randomly sticking them into the mt genome from the mouse. Good enough for government work, but randomly some of the simulated fragments overlap enough so that bedtools merges them, in which case they are joined as one cluster and then muscle tries to align them with obvious consequences.
Isaac Overcast
@isaacovercast
Mar 05 2016 01:34 UTC
For the sim data this affects 3 clusters total, i don't think there's anything we can do about it unless we simulate a better mitogenome. I'll push the fix for the first problem momentarily.
Deren Eaton
@dereneaton
Mar 05 2016 01:40 UTC
OK. Thanks. Sounds good. Well have to start testing on real data more to make sure it's working. It's good that the filter is excluding the bad clusters.
Isaac Overcast
@isaacovercast
Mar 05 2016 01:44 UTC
Yah, once i get glenn's gbs data run all the way through denovo i'll start testing refmapping. should be interesting. you in idayho?
Deren Eaton
@dereneaton
Mar 05 2016 01:59 UTC
Yep
Ipyrad workshop is Sunday
So theres a little time left for testing
But ask the sim datasets should be working now
And I'll do preview mode on an empirical dataset
Deren Eaton
@dereneaton
Mar 05 2016 13:55 UTC
I'm testing with the sim_rad_test refmap data set now.
It seems to be working better in so far as the alignment bug is fixed. But why do the samples recover different numbers of refmatched reads? It seems they should all match to the same regions of the reference, but I'm getting between 50 and 80 clusters recovered among the different samples.
Deren Eaton
@dereneaton
Mar 05 2016 14:06 UTC
Is this a problem with how smalt identifies homology, or with our treatment of the mapping results?
Deren Eaton
@dereneaton
Mar 05 2016 14:17 UTC
I think the problem has to do with refmapping sticking a bunch of inserts or Ns in the reads somewhere. This is what I see in the .loci file from a finished refmap run
1C_0     TGTTTCCTAGACGGGAGATCATC-----TGAAATTCGGTT-TTATCCTTGACCTGGTGCCACCGCCTCAATCTCTTGATTTCAGACCAAAAGAA
1D_0     TGTTTCCTAGACGGGAGTTCATCNNNNNTGAAATTCGGTTNTTATCCTTGACCTGGTGCCACCGCCTCAATCTCTTGATTTCAGACCAAAAGAA
2E_0     GGTTTCCTAGACGGGAGTTCATC----NTGAAATTCGGTTNTTATCCTTGACCTGGTGCCACCGCCTCAATCTCTTGATTTCAGACCAAAAGAA
2G_0     TGTTTCCTAGACGGGAGTTCATCNNNNNTGAAATTCGGTTNTTATCCTTGACCTGGTGCCACCGCCTCAATCTCTTGATTTCAGACCAAAAGAA
3I_0     TGTTTCCAAGACGGGAGTTCATCNNNNNTGAAATTCGGTTNTTATCCTTGACCTGGTGCCACCGCCTCAATCTCTTGATTTCAGACCAAAAGAA
//       -      -         -                                                                            |
1B_0     TCGTGATGGTGCTGACTCAATTGGTACTCTGGCACGAT-----TATGCCGCATAGCACCACCCGTGCCGGATTGGCACCGNNNNNNNNNACAAGTGTTAGGC
1C_0     TCGTGATGGTGCTGACTCAATTGGTACTCTGGCACGAT-----TATGCCGCATAGCACCACCCGTGCCGGATTGGCACCGNNNNNNNNNACAAGTGTTAGGC
2G_0     TCGTGATGGTGCTGACTCAATTGGTACTCTGGCACGAT-----TATGCCGCATAGCACCACCCGTGCCGGATTGGCACCGNNNNNNNNNACAAGTGTTAGGC
3I_0     TCGTGATGGTGCTGACTCAATTGGTACTCTGGCACGATNNNNNTATGCCGCATAGCACCACCCGTGCCGGATTGGCACCG-----NNNNACAAGTGTTAGGC
//                                                                                                             |
ugh, something is going wrong here. Reads with than many Ns should have been removed in step5 anyway...
Deren Eaton
@dereneaton
Mar 05 2016 14:50 UTC
I'm trying to get a sense of the refmapping code. Do you refmap the reads before we dereplicate them?
Deren Eaton
@dereneaton
Mar 05 2016 15:27 UTC
I think I'll have to skip doing refseq for the workshop. It seems close, but there's definitely some major bugs to work out. I should have tested with it earlier. We should plan a meetup some time soon so we can really dive into the code together.
Isaac Overcast
@isaacovercast
Mar 05 2016 16:43 UTC
What clustering threshold are you using?
Deren Eaton
@dereneaton
Mar 05 2016 16:44 UTC
That was 0.85
Isaac Overcast
@isaacovercast
Mar 05 2016 16:44 UTC
Refmap happens before dereplication, but then it dereplicates the mapped reads itself.
Deren Eaton
@dereneaton
Mar 05 2016 16:44 UTC
After I had hacked in it a little bit
What was the idea behind doing it after vs before?
Isaac Overcast
@isaacovercast
Mar 05 2016 16:45 UTC
Increasing clustering threshold solves some of the problem (at 0.90 you only see very few that are fsck).
Right now we use clustering threshold for both vsearch clustering and refmap sequence similarity. It's a simplifying shortcut, but not sure if it's justifiable. We might need to add a parameter for refmap sequence similarity if it turns out they need to be tuned independently. Needs more testing.
I guess the idea of doing refmap before dereplication was about potentially using quality scores during the mapping process.
It also simplifies the code a bunch. Dereplication happens inside clustall() after merging pairs, so we also want to map paired reads independently.
Deren Eaton
@dereneaton
Mar 05 2016 17:24 UTC
Oh, does it use qual scores? If so, that makes sense.
Deren Eaton
@dereneaton
Mar 05 2016 17:34 UTC
Note from Julian's talk: random oligos to deal with pcr duplicates.
Isaac Overcast
@isaacovercast
Mar 05 2016 17:35 UTC
At clustering threshold of .9 all the reads that show this bad behavior are a result of simulated reads overlapping regions. I'm investigating lower clustering thresholds (#126). In practice this shouldn't be a major issue. Also, i'm not sure if i'd expect all samples to recover the same number of refmapped sequences (because of the way the sim genome was constructed (by randomly selecting reads from simrad.gz)). The simulated mt genome isn't necessarily going to do anything very interesting, it's more a proof of concept.
It doesn't use qualscores now, but i wanted to have the option for the future (V2.0), more importantly is the unmerged pairs.
Isaac Overcast
@isaacovercast
Mar 05 2016 18:46 UTC
I created a new sim_mt_genome that isn't so much of a shit show. It has 24 sequences inserted from sim_rad_test_r1_ so that none are overlapping, no high indels errors even at low clustering thresholds.
I also changed the default smalt wordlen (in hackersonly) so it works better on the sim data. Will have to experiment on real data to get a better sense of default, but my experience is 13-16 is best.
Isaac Overcast
@isaacovercast
Mar 05 2016 19:22 UTC
oh hell yeah, the implemented the new solver in conda (>4.0), it's really fast.
Deren Eaton
@dereneaton
Mar 05 2016 21:42 UTC
hells yeah. Sounds awesome. I'll play around with the new stuff while I watch Julian's presentation.
Deren Eaton
@dereneaton
Mar 05 2016 21:56 UTC
did you push that yet? I see smalt_wordlen=8 in my version
Isaac Overcast
@isaacovercast
Mar 05 2016 21:57 UTC
8 is the optimal value for the simulated data, so you've got the newest ver
Deren Eaton
@dereneaton
Mar 05 2016 22:04 UTC
I'm pushing 0.1.73 now with fix for step6
Isaac Overcast
@isaacovercast
Mar 05 2016 22:05 UTC
cool thx
Deren Eaton
@dereneaton
Mar 05 2016 22:06 UTC
I might need you to push a mac conda version before tomorrow of whatever version we end settled on by the end of today.
Isaac Overcast
@isaacovercast
Mar 05 2016 22:12 UTC
no problem.
Deren Eaton
@dereneaton
Mar 05 2016 22:13 UTC
I made some minor changes to refmap, before you made fixes. didn't find any conflicts on pull, but maybe you should check that I didn't mess anything up.
Isaac Overcast
@isaacovercast
Mar 05 2016 22:33 UTC
looks ok.
Isaac Overcast
@isaacovercast
Mar 05 2016 23:12 UTC
I can calculate inner mate distance for mapped pairs, rather than hard coding it (dumb). On real data do you think it's better to use max inner mate distance or mean +2 standard deviations? I'm inclined toward the latter.
Deren Eaton
@dereneaton
Mar 05 2016 23:29 UTC
is using the max slower?
Isaac Overcast
@isaacovercast
Mar 05 2016 23:34 UTC
Not slower, just the max insert size could be a significant outlier