These are chat archives for dereneaton/ipyrad

13th
Dec 2015
Deren Eaton
@dereneaton
Dec 13 2015 01:20
Hmm, I know there are implicit ways to make sure that modules are imported on clients, and there is a decorator @require that might help, these are explained a bit in the ipyparallel docs, but I haven't found that I've had to use them at all yet. I'm not sure why it wouldn't work if its following the multi_muscle_align...
Isaac Overcast
@isaacovercast
Dec 13 2015 17:37
oof, nasty bug. I had to cycle ipcluster, think it had an older version of the module loaded and i couldn't figure out how to get it to nicely reload. I saw some different methods for loading libs in the ipp docs but none of them worked, had to do it "the old fashioned way". I think herding this ipp cat is going to be the biggest challenge, but it's def worth it, it works good, it's just picky.
Deren Eaton
@dereneaton
Dec 13 2015 20:18
Alright, I'm finally settled on the fact that we are going to use h5py. I'm using it now in step6 to build a giant array with all of the depth info for all consensus reads that clustered together. This isn't sooo hard, since I saved the depth information from each individual sample in step3 as an array, so its just taking the arrays for each sample, and the hitsmap from vsearch of which names go together in a locus, and filling in the matrix. Easily parallelizable too. And similarly easy to put refmapped matches in here. The pain is that I need to pull all of the indels out of the muscle aligned files and insert them back into this giant matrix. I was working with chunking up arrays to avoid memory limits and finally decided that this is really what h5py and dask are for. So I'm going for it, and trying to use these. I think I'm getting it. It's going to be pretty bad ass in the end. If it works well we might want to go back eventually and use a similar framework for steps 1,2 and 5. We'll see.
And dask is totally integrated with ipyparallel, just found this:
from dask.distributed import dask_client_from_ipclient
dclient = dask_client_from_ipclient(ipclient)
Isaac Overcast
@isaacovercast
Dec 13 2015 21:29
Sounds good. I'll have to go read more about this. I haven't looked up dask before and from the name i can't even begin to guess what it does. Mystery!
Question...
Reference mapping can do one of two things, it can hard mask or soft mask the alignment. Hard mask means it returns ONLY the bases that successfully align. Soft mask means it returns the aligned bases and the unaligned flanking regions.
Deren Eaton
@dereneaton
Dec 13 2015 21:32
that reminds me of an important question I've been meaning to ask: how does one toggle the threshold for successful mapping? Does the clust_threshold parameter work the same way?
I feel like soft-masking sounds better, but I don't have a good idea for what that would look like on real data, and if the flanking regions could potentially be really divergent
Isaac Overcast
@isaacovercast
Dec 13 2015 21:34
(I'm currently deciding on that, we could use the clust_threshold value, it is very similar for the aligner % sequence identity)... Now, the pileup naturally only understands what to do with successfully aligned bases so it's just throwing away the unaligned flanking regions regardless of hard or soft masking, so here's my question, since we aren't (at this point) using the pileup to call snps, and we might throw mapped reads into the pipeline before muscle, would it be better just grab all raw, unmasked reads in a stack and skip the pileup process?
Right, soft-masking sounds better to me, because you aren't throwing away data. Plus if we did it this way it'd route around a bunch of nonsense I have to do to decompile the fasta from the pileup format (even tho i fuckin worked on that shit for days! )
Deren Eaton
@dereneaton
Dec 13 2015 21:38
lol.
I don't have a very good sense of what each program is doing
without pileup could you still get stacks that are not perfectly overlapping?
in terms of pulling out raw unmasked reads that mapped to the same region
Isaac Overcast
@isaacovercast
Dec 13 2015 21:42
Getting stacks of perfectly overlapping reads is bedtools merge.
I can use the results of merge to pull out all unmasked reads that overlap with each region it specifies.
to make the stacks
Deren Eaton
@dereneaton
Dec 13 2015 21:43
that would be good enough for rad and ddrad, gbs reverse matches will have a slight stagger
This message was deleted
oh well, that won't work
Isaac Overcast
@isaacovercast
Dec 13 2015 21:45
I saw what you were trying to do.
Deren Eaton
@dereneaton
Dec 13 2015 21:45
haha
the cut sites won't overlap
the read will
we could probably deal with that though
Isaac Overcast
@isaacovercast
Dec 13 2015 21:46
right, which is fine, but are they always equal length?
Deren Eaton
@dereneaton
Dec 13 2015 21:46
yeah, step2 trims R2 to be the same length as R1
it is possible that some reads could get trimmed shorter though
Isaac Overcast
@isaacovercast
Dec 13 2015 21:47
just for gbs or for ddrad as well?
Deren Eaton
@dereneaton
Dec 13 2015 21:48
just on gbs, since the ddrads won't reverse match
... I think, I'll check
Isaac Overcast
@isaacovercast
Dec 13 2015 21:50
Ok, good, well given what we've talked about i think i'm gonna go re-engineer the refmap.py, route around the pileup business (it was a snakepit anyway). Sounds like all the potential problems can be dealt with..
Deren Eaton
@dereneaton
Dec 13 2015 21:51
sounds good. I've got some messy data pairgbs and pairddrad data we can test later that should reveal any problems with variable length reads.
Isaac Overcast
@isaacovercast
Dec 13 2015 21:52
Also, i fabricated a small pseudo-mitogenome for testing the rad-sim data, just took the mouse mt genome, randomly grabbed 100 reads from the test_rad_sim.fq.gz and crammed them in there at random positions. I tried it and it seemed to work ok, think that's a workable strategy?
Deren Eaton
@dereneaton
Dec 13 2015 21:53
brilliant
Isaac Overcast
@isaacovercast
Dec 13 2015 21:53
:sunglasses:
Alright, i'm gonna get back to hackin on it, shouldn't be too tricky to rearrange. Hope alls good...
Deren Eaton
@dereneaton
Dec 13 2015 21:57
you too. I was gonna take a break and work on other things, but I got a little too excited about finishing step6. Boss ain't gonna be happy. I'll hopefully finish it tonight and then probably step away for a little bit. You can maybe hack away at step7 next week, which is pretty much just applying filters and formatting.
I moved muscle alignment of clusters from step7 to step6. I think step6 will finish with a vcf output that can be used to apply filters to and to create all other output files from. At least that's what I'm thinking at the moment. It should be a lot more efficient that iterating over .loci files.
Isaac Overcast
@isaacovercast
Dec 13 2015 22:12
Yeah, perfect. vcf will be good for summary stats too, thats a convenient input format. Alright man talk to you soon..
Deren Eaton
@dereneaton
Dec 13 2015 22:12
later
Isaac Overcast
@isaacovercast
Dec 13 2015 22:50
Ungh! Re-architected the whole thing, like we talked about. Routed around pileup, pulled in unmasked reads from sam, dropped the clusters back in before muscle. Compile. Execute. Works on the first try. :sunglasses:
Isaac Overcast
@isaacovercast
Dec 13 2015 23:16
                 state  reads_raw  reads_filtered  refseq_mapped_reads refseq_unmapped_reads  clusters_total  clusters_kept
       1A_0      3      20099           20099                 1594            18505                        1007           1002
Works good for rad-seq sim data. I'm psyched.