These are chat archives for dereneaton/ipyrad

4th
Jan 2016
Deren Eaton
@dereneaton
Jan 04 2016 17:48
OK, I'm back and taking a look at things. Had to replace my laptop HD, and I couldn't connect to my computer at work over the break, so I pretty much did zero work. However, there were a bunch of changes on my workstation that I didn't push before leaving, including a reworking of the parameter order, and set_params, such that we can now reorder them easily without having to rewrite the numbering (params are all linked to their keywords and their index is their order in the paramdict OrderedDict). I also have some simplification of the IPcluster client launcher. I'll work on merging these with all of your changes.
Isaac Overcast
@isaacovercast
Jan 04 2016 19:32
Sounds good. Where'd you go for break?
Isaac Overcast
@isaacovercast
Jan 04 2016 19:38
So i spent a little time figuring out step6 and looking at step7 in the pyrad codebase, mostly trying to figure out exactly what step7 needs to do in ipyrad. It looks like cluster_across is doing lots of the work of the old step7 (doing the alignment, etc). What's your expectation for the eventual output of step6 (looks like you intend to output vcf, at least from the stubbed code). So then for step 7 is your expectation this is just going to be all about file conversion? What else did you have in mind for step 7?
Deren Eaton
@dereneaton
Jan 04 2016 19:45
Yeah, just file conversion after applying filters: max_shared_heterozygosity, minsamp, maxSNP, etc.
Deren Eaton
@dereneaton
Jan 04 2016 19:59
My plan was that we do the alignment to get the location of indels which we would then insert into the catg array. At that point the only thing that we are missing is the phase information within loci (which is only relevant to diploid data). That data is contained in the goofy lowercase lettering in the consens reads, but maybe it would be better to store it in an array during step5 as well. In other words, the consens reads would be written to always have a certain allele precedence (e.g., A/T -> A; A/C -> C; C/T -> C) but the ordering of alleles would be stored in an array.
 [sample1][consread1][allele1][heterobase1] = "A"
 [sample1][consread1][allele2][heterobase1] = "T"
...
[sample1][consread100][allele3][heterobase3] = "C"
The dims of the array for each sample would be [nconsensreads] x [max_alleles] x [maxHs] and anything that didn't fit within the dims would be discarded.
It could be a really large but very sparse array, however (like the indel array). in which case querying it would not be the most efficient thing to do...
Deren Eaton
@dereneaton
Jan 04 2016 20:11
It would be easy to build a .loci file from the cat.clust.gz file at the end of step6 by recycling code from the old step7. But I think it makes more sense to do the alignment in step6 and have one master output file (VCF) from which all the others are made. Using the large catg arrays seems like the best way to get the depth information for each individual from step5 into aligned, filtered and ordered loci in step7.
For now lets move along without worrying about phase information, since we should be able to go back and incorporate that later.
Isaac Overcast
@isaacovercast
Jan 04 2016 20:19
Great. I'll start working on step 7 assuming it'll read in vcf and the super-catg. probably do filtering and write vcf to .loci, so then we can recycle all the old conversion code.
You still want to handle filtering outgroups and excludes in step7()?
Deren Eaton
@dereneaton
Jan 04 2016 20:37
hmm..., excluding taxa would be easy in the API by entering only the Samples that you want to include in the call to step7(). But for the CLI we might need to have a excludes line in the paramsfile... That's a pain.
Deren Eaton
@dereneaton
Jan 04 2016 20:43
The outgroup handling doesn't actually seem that useful in the end. This is most useful for step6 where outgroups can be pushed to the end of the concatenated consens file before clustering. It's a similar idea for hierarchical clustering (order consens reads so that close relatives are likely to be compared first, then more distant relatives). Both I feel are novelty features, and not all that useful/important in the end. I would say we add them later into step6 if we feel it's necessary.
Isaac Overcast
@isaacovercast
Jan 04 2016 20:45
excludes in the paramsfile seems like a small price to pay for backwards compatibility, i'll add it.
Deren Eaton
@dereneaton
Jan 04 2016 20:45
ok. cool.
The supercatg array is not being filled yet.
Isaac Overcast
@isaacovercast
Jan 04 2016 20:46
OK, i'll gin something up just for development.. hey, how were you thinking of preview() working?
Deren Eaton
@dereneaton
Jan 04 2016 20:46
I think it simply requires iterating over the individual catg arrays and indexing who grouped together in the clusters to fill it in. I'll work on that.
Isaac Overcast
@isaacovercast
Jan 04 2016 20:46
I'm still kludging it the old way in step3 cuz it's useful, but would love to have it working on the whole enchilada
ok
Deren Eaton
@dereneaton
Jan 04 2016 20:49
Oh yeah, I backed off the preview mode a little bit since I figured it would be easy to add in later (tho it's probably more difficult w/ regard to the refmapping). But I still like the idea. I think like you said we should have a way that subsamples the users data. We might need to write a subsample routine into both steps 1 and 2, since users could start from either one.
Isaac Overcast
@isaacovercast
Jan 04 2016 20:52
Cool, i have a subsample routine already, so i can slot it in pretty easy. Do you like having 'preview' as an argument for each step? I kinda liked that, but if you have a better idea i'm happy to hear it.
Isaac Overcast
@isaacovercast
Jan 04 2016 21:53
I created a new private function for assemble called _get_samples() In creating step7 code i saw that all other steps were doing somewhat similar things to get samples from the passed in list of strings to a list of sample objects. I pulled all this code out and added it to one function so now all the steps handle parsing sample command line args in exactly the same way, all steps fail on assert if there are no samples to process. Also, added a case to handle if you only want to pass in one sample you can just pass it in as a string instead of a list with one string element, convenience function.
Isaac Overcast
@isaacovercast
Jan 04 2016 22:31
Also, i'm going to assume you're gonna set samples to state=6 after cluster_across(), i want to test for this so we don't try to write samples that haven't been clustered..
Isaac Overcast
@isaacovercast
Jan 04 2016 23:54
Any reason we can't just use an plain old dict rather than an ordered dict for the params?
Looking at paramsinfo, it needs to get updated for the changes to param order in assembly. Dict would be easier to maintain w/o the ordering, we never reference params by number in the codebase. Not a big deal, but could make our lives easier