These are chat archives for dereneaton/ipyrad

16th
Jan 2016
Isaac Overcast
@isaacovercast
Jan 16 2016 19:17
Pulled your newest changes and testing. In consens_se.removerepeat_Ns(shortcon) it looks like it wants to have 'stacked' passed in too.
Deren Eaton
@dereneaton
Jan 16 2016 20:23
Yeah, sorry I was being lazy and pushed a broken branch before I had time to fix everything. I was leaving town and wanted to make sure I pushed the changes on my workstation before leaving.
figured it's fine since we're still in development mode.
I've been working on some simple HPC testing recently.
Deren Eaton
@dereneaton
Jan 16 2016 20:30
One thing that worries me is that HPC systems recommend that you limit I/O operations, which tend to go pretty slow on their large shared filesystems. But a lot of our parallelization methods rely on splitting files into many small bits, which we need to be written, read back in, and then rejoined later. For some operations we could pass the bits all in memory, as opposed to writing files, since machines will most likely have >20GB RAM and can certainly handle holding the entire fastq file for one sample in memory at a time. I just wonder if their is large overhead in passing large memory items between processes. I guess this whole train of thought goes in the 'later refining' category, and it not a top priority. But it could give us speed improvements eventually, especially for users using PBS, etc, to access many dozens of CPUs.
Deren Eaton
@dereneaton
Jan 16 2016 21:11
Yeah!, Finally got ipcluster to connect to multiple nodes using MPI in a way that doesn't require users to edit tons of config files.
Like you recommended before, it's probably easiest if we just created one ipyrad-local and one ipyrad-mpi profiles for the users. Maybe this can be done during installation or at first __init__.
Isaac Overcast
@isaacovercast
Jan 16 2016 21:15
Sweet! MPI always makes my brain cringe, I'm glad you're tackling it :thumbsup:
I'm trying to clean up main and get the CLI working better.
Deren Eaton
@dereneaton
Jan 16 2016 21:20
Just ran steps 1&2 using 24 MPI cores on a cluster. Success.
Cool. Yeah I haven't checked at all lately whether the CLI was working.
We'll need to add mpi4py as a requirement to setup.py
It's really nice since it installs openmpi into the local anaconda/bin, so users don't need to load it.
Isaac Overcast
@isaacovercast
Jan 16 2016 21:24
that is nice. Complexity of MPI normally kills the buzz for users, anything we can do to make it easier will be great.
Deren Eaton
@dereneaton
Jan 16 2016 21:25
yeah, kills my buzz too usually. I don't want to debug other peoples MPI setups forever, so we should definitely put a warning above it in the user guide that we recommend it for advanced users. Newbs should just use the Local cluster.
But it is so useful if we can make it easy. I think a lot of people now use pyrad by asking for 64 cores on an HPC without worrying about the their setup, and then they're confused when it runs slow because they actually connected to 8 nodes with 8 cores each, and pyrad can only actually talk to 8 of them.
Isaac Overcast
@isaacovercast
Jan 16 2016 22:24
Some recent change to rawedit is replacing the first base of each sequence in the edits/*fastq files with '0' (zero). I'm trying to debug, but if you have any ideas lemme know
Deren Eaton
@dereneaton
Jan 16 2016 22:25
dang, seems like I broke a bunch of stuff.
I'm still messing around with remote stuff.
pretty easy to setup and you can open a notebook locally that is running remotely. Here I have my 40 core workstation at Yale running on my laptop at home.
viewing the notebook through the browser on my laptop.
This would be so epic to setup for HPC systems, and it seems a lot of people are working on that. But connecting to a login node and then to a compute node is still pretty complicated. Anyway, still a cool trick.
Isaac Overcast
@isaacovercast
Jan 16 2016 22:30
Yeah, that's super useful. I do that too, but a little different, i do ssh tunnelling to my remote host and reroute the notebook port to localhost:8888. Works exactly the same way. It's really useful.
The rk package would be easier to customize, and it allows you to actually control remote notebooks, which I can't do with my hack.
But the login node/compute node issue would be a bitch to handle in either case i think. Unless we got really fancy. Another V2.0 feature...
Isaac Overcast
@isaacovercast
Jan 16 2016 22:35
Hey, what are these:
    lookfor1 = "AGATCGG"
    lookfor2 = "AGAGCGT"
Deren Eaton
@dereneaton
Jan 16 2016 22:47
I only use the first one, I think. "AGATCG" is the beginning of the adapter.
Isaac Overcast
@isaacovercast
Jan 16 2016 22:48
Illumina adapter?
Deren Eaton
@dereneaton
Jan 16 2016 22:48
I guess... it's whatever comes after the revcomp(cut2) on a read that is shorter than the actual fragment
This message was deleted
This message was deleted
This message was deleted
This message was deleted
This message was deleted
AATTCCTGCAGAAAAAAAAAAAAAAAAAAAAAACTGCAAGATCGGAAAAAA
----- (barcode)
     ------ (cut)
                                 ----- (revcomp cut)
                                      ----- (adapter)
This editor is killing me.
anyway, it's super necessary for trimming bad reads. We do a search for the revcomp(cut2), then for the adapter, then for a little of both I think, since they can often contain errors in them.
Deren Eaton
@dereneaton
Jan 16 2016 22:54
for single-cutter GBS the revcomp is of the single cutter, for ddRAD it is of course of the second cutter.
Isaac Overcast
@isaacovercast
Jan 16 2016 22:55
Think i found the rawedit culprit: 337-343.
I'll fix it.
Isaac Overcast
@isaacovercast
Jan 16 2016 23:15
def modify_cuts(data, read1, read2):
    """ fix cut sites to be error free and not carry ambiguities """
    ## shorthand
    cutsmod = data.paramsdict["edit_cutsites"]

    ## replace cut region with a single resolution of cut w/o ambiguities
    if len(read1) and cutsmod[0]:
        if isinstance(cutsmod[0], int):
            read1[1] = read1[1][abs(cutsmod[0]):]
            read1[3] = read1[3][abs(cutsmod[0]):]
        elif isinstance(cutsmod[0], str):
            read1[1][:len(cutsmod[0])] = list(cutsmod[0])
            read1[3][:len(cutsmod[0])] = ["B"]*len(cutsmod[0])
Deren Eaton
@dereneaton
Jan 16 2016 23:16
ah, yeah, I haven't explained what that does yet.
Isaac Overcast
@isaacovercast
Jan 16 2016 23:16
I don't understand the use case where edit_cutsites is a tuple of strings?
lol
Deren Eaton
@dereneaton
Jan 16 2016 23:19
if you want the first five bases to always be your cutsite, e.g., TGCAG, then you use: data1.set_params("edit_cutsites", ("TGCAG", ""))
Second reads should start with their cut site as well, e.g., "AATT".
Isaac Overcast
@isaacovercast
Jan 16 2016 23:20
I see. I'm gonna fix this by fixing the paramschecker code for edit cutsites. It'll test for int vs string and set the vals of the tuples accordingly. That way this code in rawedit should work without modification.
Deren Eaton
@dereneaton
Jan 16 2016 23:21
In one of my data sets, the first reads often have errors in them, so I wanted them replaced with "TGCAG", whereas the seconds were super loaded with errors and so I thought it better to just remove them. So setting an integer removes that many bases from the beginning of the read.
so a sequence replaces, and an integer trims.
Isaac Overcast
@isaacovercast
Jan 16 2016 23:23
Got it.