These are chat archives for dereneaton/ipyrad

8th
Apr 2016
Isaac Overcast
@isaacovercast
Apr 08 2016 01:00
Here's the trick i'm using to estimate fq.gz file size for step 1, peel off 10000 lines, gzip to a file, read the size and extrapolate to the size of the big file. It works really good, but here's the problem, gzip has 9 different compression levels. I'm thinking to set the test level at 5 and split the difference. For real data it ends up estimating 329M reads, where real data under least compression is 300M, and max compression is ~360M. Thouths?
Isaac Overcast
@isaacovercast
Apr 08 2016 02:55
Step 1 with approximation of fq.gz file size and better guess of optim is 3 hrs vs 6.5 hrs on real data. woot.
Deren Eaton
@dereneaton
Apr 08 2016 06:01
That's killer!
Deren Eaton
@dereneaton
Apr 08 2016 18:22
it seems to work pretty well, especially if you increase the '-gapopen' penalty in muscle. Needs some testing on real data for sure though.
now that I've really dug in deep to the refmap code I understand it all a lot better.
I'm interested to hear what you think of it. I made a ton of changes to make step3 into more of a DAG-like workflow.
Deren Eaton
@dereneaton
Apr 08 2016 18:28
OK, it's up under a branch called "apply3"
Deren Eaton
@dereneaton
Apr 08 2016 18:39

Here's a primer on the major changes:

  • Apply statements add jobs to the load-balanced task scheduler (lbview) which starts them running on whichever engines are available, and can do so as the engines become available. Once we get everything running this way the initial startup of ipyrad can go a bit quicker.

  • To make sure certain jobs do not run before some earlier dependency has finished we wrap the job in an "after" statement, and enter as an argument the async result from the earlier job, which we stored in a dict. The example below does not run the mcfunc job until the the result stored in res_clust[sample] is finished.

    with lbview.temp_flags(after=res_clust[sample]):
      res_clean[sample] = lbview.apply(mcfunc, [data, sample])
  • Dereplication now happens as the first job in step3, and dereped reads are mapped to the reference. I know we are losing some fancy quality score info we could have used for this, but I think we should just let it go.

  • I removed the mpileup code. Again, it seems we are kind of going all in for the ipyrad base calling, and it made it much easier to look through the refmap code. If we want to use it we can find it in the repo.

  • We index the reference using both smalt and samtools faidx. The former gets the mapped reads which we pull out with bedtools. The latter is used to pull out the reference sequence using samtools for the window of mapped regions that bedtools found. We write to clust.gz the reference seq and mapped reads, which are then clustered together with muscle. Before writing the clustS.gz file the reference is removed.

I have not tested reference+denovo or reference-denovo, they are likey both broken at the moment. But I think rebuilding them within the DAG-like workflow will be much clearer. I would like to avoid the process we had earlier where sample.files.edits is overwritten in preview mode and refmapping mode, and reset at the end. It seems a bit too risky. I would prefer we use alternative names for different files that mapped v unmapped, etc, if possible. I tried to do this, but I'm not sure it's completely solid for all cases (e.g., denovo-reference) yet.
Isaac Overcast
@isaacovercast
Apr 08 2016 18:57
Oh wow! Cool, i'll check it out. At first glance i definitely like the flow control changes (DAG) at least in principle.... I was just precious about the mpileup code cuz i worked on it forever, but i agree it's useless baggage at this point... Also i agree that the preview mode overwriting sample.files.edits always made me uncomfortable, i'd be happy if we rewrote that somehow... As for the rest, i'll check it out and let you know once i get to mess with it a bit...