These are chat archives for dereneaton/ipyrad
Here's a primer on the major changes:
Apply statements add jobs to the load-balanced task scheduler (lbview) which starts them running on whichever engines are available, and can do so as the engines become available. Once we get everything running this way the initial startup of ipyrad can go a bit quicker.
To make sure certain jobs do not run before some earlier dependency has finished we wrap the job in an "after" statement, and enter as an argument the async result from the earlier job, which we stored in a dict. The example below does not run the
mcfunc job until the the result stored in res_clust[sample] is finished.
with lbview.temp_flags(after=res_clust[sample]): res_clean[sample] = lbview.apply(mcfunc, [data, sample])
Dereplication now happens as the first job in step3, and dereped reads are mapped to the reference. I know we are losing some fancy quality score info we could have used for this, but I think we should just let it go.
I removed the mpileup code. Again, it seems we are kind of going all in for the ipyrad base calling, and it made it much easier to look through the refmap code. If we want to use it we can find it in the repo.
We index the reference using both smalt and
samtools faidx. The former gets the mapped reads which we pull out with bedtools. The latter is used to pull out the reference sequence using samtools for the window of mapped regions that bedtools found. We write to clust.gz the reference seq and mapped reads, which are then clustered together with muscle. Before writing the clustS.gz file the reference is removed.