These are chat archives for dereneaton/ipyrad

7th
Jan 2016
Deren Eaton
@dereneaton
Jan 07 2016 00:35
hells yeah, this is actually working really nicely. I've got supercatg building. Now I just need to insert indels back into it.
Regarding randomseeds: There are a few other places that random seeds are used. My practice, when I remember to do so, is to have a fixed default random seed, so that unless users change it, the run will be reproducible. However, last I remember hearing, pyrad was not giving exactly reproducible results, so I must have forgotten to set a random seed somewhere. It would be nice to get it right. Hackersdict would probably be a fine place for it. Random is potentially used when the input order is randomized before clustering. Also in loci2SNP when a SNP is randomly sampled.
Isaac Overcast
@isaacovercast
Jan 07 2016 03:03
hm, my supercatg isn't building right. It crashes at singlecat in a weird way. I will try rebuilding with a new assembly, since i'm testing with one that already made it through step3
Isaac Overcast
@isaacovercast
Jan 07 2016 03:14
Also, i forsee an interesting hassle: assuming people use this and assuming we continue to modify/improve, people are going to really want a way to "upgrade" their assembly objects without having to rebuild them. If we make changes to assembly.py, it obviously doesn't affect currently extant assembly objects that you load.load()... just a thought.
Deren Eaton
@dereneaton
Jan 07 2016 18:23
Oh, supercat isn't ready for paired data yet. Should be a quick edit tho.
re upgrades: yeah, good point. That's gonna be a tough thing. We should certainly try to minimize changes to Assembly between major updates. Maybe if we tag Assembly objects with the __version__ we can at least catch imcompatibilities and recommend some kind of upgrade method.
Isaac Overcast
@isaacovercast
Jan 07 2016 18:35
That's a pretty good idea actually, makes a ton of sense
Isaac Overcast
@isaacovercast
Jan 07 2016 18:45
I'm going to make the hackers only dict real quick because it feels like we've been talking about it for a while and there are some good candidates for inclusion. Got any name in mind for it? advancedparams? I mean we could call it hackersonly too. either way
Deren Eaton
@dereneaton
Jan 07 2016 19:56
I don't know, I kinda like calling it data.hackersdict. Cuz, you know, it's our program, we can call it whatever we want. And the expectation is that its only going to be used by people who are doing something unusual with the API, so it's not like we have to talk about it in the publication.
Isaac Overcast
@isaacovercast
Jan 07 2016 19:57
lol! ok, i'll call it that. I'm done just testing to make sure it doesn't break stuff
Isaac Overcast
@isaacovercast
Jan 07 2016 20:20
Next commit to master is going to require rebuilding any existing Assembly objects, it includes data._hackersonly with 4 params right now (random_seed, max_fragment_length, max_inner_mate_length, and preview_truncate_length). max_fragment_length is used to set maxlen in cluster_across, so i updated that code to use the new hackdict.
Also added a data._version stamp, cuz that's a good idea, maybe we can use minor version number increments to indicate "This update breaks old assemblies".
Deren Eaton
@dereneaton
Jan 07 2016 20:29
That all sounds good. max_fragment_length is also used in consens_se.py to build the original .catg files. I'll keep mulling it over, because maybe there's a way to autodetect the longest length from the data without too much overhead. But for now, hackersdict is good place for it.
Isaac Overcast
@isaacovercast
Jan 07 2016 20:33
Here? catarr = numpy.zeros([optim, 210, 4], dtype='int16')
I agree there must be a way to autodetect, i mean we're running through the data frequently, shouldn't be hard to dip in somewhere, i'll think about it..
Deren Eaton
@dereneaton
Jan 07 2016 20:51
Yeah. The only thing to worry about is weird data like gbs or pairgbs where multiple fragments can partially overlap to build contigs that are longer than the sum of the original forward and reverse reads. We could just trim those when it happens, tho, since it's pretty rare.... yeah, I think this can be done.
More immediately, though, I'm focusing on turning the supercat from step6 into vcf and loci files.
Isaac Overcast
@isaacovercast
Jan 07 2016 21:23
Agreed. I like the idea of just trimming if reads build long contigs, it can't happen often... I'll switch catarr to use max_frag_length so everything is behaving the same.
Deren Eaton
@dereneaton
Jan 07 2016 21:43
0.1.0 tag, woot!
Speaking of which. I'm still unclear about when we should be adding a git tag. Should we be labelling every bug fix with a tag?
Isaac Overcast
@isaacovercast
Jan 07 2016 21:54
I mean, it could be that every commit to master gets a dot-version increment 0.1.1, 0.1.2, then minor versions indicate major updates like if assembly changes, and major versions are for full realease only? Adding a tag for every commit seems kinda annoying, but i'm down for it if you are.
What do you know about orientation of second reads in PE ddrad/gbs
Isaac Overcast
@isaacovercast
Jan 07 2016 22:01
My belief is that R2 would be oriented reverse on the forward strand, so in the r2 file they'd read 3' to 5', just trying to get my simulated genome right so i can get the flags right in smalt.
Deren Eaton
@dereneaton
Jan 07 2016 22:35
I'll email you some real raw data files.
Deren Eaton
@dereneaton
Jan 07 2016 23:43
OK, so before committing to Master we check the tag and increment it accordingly. I'll try to remember to do that.