These are chat archives for dereneaton/ipyrad

11th
Feb 2016
Isaac Overcast
@isaacovercast
Feb 11 2016 00:28
Ungh, i don't know if you remember the problem i was having with gbs data where it was crashing inside muscle_align at line 189:
                    for rseq in revs:
                        try:
                            idxs.append(max(\
                                [i for i, j in enumerate(rseq) if j != "-"]))
I fukcin finally tracked that fucker down. Found the offending sample, then dug through all the tmp .ali files took for fuckin ever. I printed out rseq:
rseq ['TGCAGAGAGA--GAAGCAACTGGAGCAAGAAACCCCTGCA------------------------------------------------------', '----------------------------------------------------------------------------------------------', 'TGCANAGAGA--GAAGCAACTGGAGCAAGAAACCCCNGCA------------------------------------------------------', 'TGCAGAGAGA--GAAGCAACTGGAGCAAGAANCCCCTGCAGCAACCCCCGGAGCTGCCATCCCAGCACACTGCA--------------------']
Take a gander.
Deren Eaton
@dereneaton
Feb 11 2016 00:49
So this looks like a example where the cut site occurred multiple times in the fragment, and so a bunch of partially overlapping fragments are mapping together. The cut is at "CTGCAG", and you see "TGCAG" at the beginning and "CTGCA" at the end. We can avoid these coming up by limiting the query_cov param in the hackersonly dict.
as for what we should do...
the second seq must be one that aligned to the far left of the seed, and thus the whole thing got trimmed off when we set the trim the reads by the left end of the seed.
Isaac Overcast
@isaacovercast
Feb 11 2016 00:51
If you want to look at the chunk and the actual locus that caused this #108
Isaac Overcast
@isaacovercast
Feb 11 2016 01:01
Assuming you're right about multiple cut sites overlapping i think it's probably safest to just toss out the whole chunk. This happened ONCE out of all the samples (~300 Mil reads), don't think we would lose much sleep over throwing it out.
Deren Eaton
@dereneaton
Feb 11 2016 01:02
this is single-end gbs, right?
Isaac Overcast
@isaacovercast
Feb 11 2016 01:03
Yes
Deren Eaton
@dereneaton
Feb 11 2016 01:03
I agree, we could just throw it out. We can count it as being filtered by too many indels.
Isaac Overcast
@isaacovercast
Feb 11 2016 01:03
yeah WAYY to many :tongue:
Deren Eaton
@dereneaton
Feb 11 2016 01:03
I've been meaning to get back to fixing the indel filter in step3, and outputing the number of indel filtered loci.
This kind of stuff can come up a lot if you size-select too small in a library
or don't size select at all, as in the case of my test pairGBS library
I've been hacking away at the CLI params.txt parsing today.
there was a major problem in not writing over parameters when they were set to empty values.
Isaac Overcast
@isaacovercast
Feb 11 2016 01:06
Hmm, i can see that
Deren Eaton
@dereneaton
Feb 11 2016 01:22
I'm thinking that the -r flag should print only the summary stats (data.stats), and then we should print just the location of the individual step stats files and users can go look at them if they want. I'll make a ticket to make it more clear.
Deren Eaton
@dereneaton
Feb 11 2016 01:45
well, I'll think about. It looks better now that you've widened the print display.
Isaac Overcast
@isaacovercast
Feb 11 2016 02:01
Sounds like an opportunity for a -v flag
Isaac Overcast
@isaacovercast
Feb 11 2016 02:55

and so the Assembly object shouldn't be saved until the user has at least finished one of the assembly steps.

I verified this is true.

Isaac Overcast
@isaacovercast
Feb 11 2016 03:10
_launch2 is throwing an error "Too many open files", this is on a brand new assembly on step1 on sim data, should be plain vanilla. See this before?
Isaac Overcast
@isaacovercast
Feb 11 2016 03:25
Wow, i totally eff-ed myself, messing with launchctl limits maxfiles set it to a low number and shit breaks! go figure. I can't even reset my box remotely bcz i can't sudo because the OS can't open the sudoers file. ugg fml
Deren Eaton
@dereneaton
Feb 11 2016 03:53
aha, its not too many files but too many open files. I saw this problem a long while back. It happens when the ipyclient processes don't close. Must be some difference if the new launch2 command. I'll look into it.
Isaac Overcast
@isaacovercast
Feb 11 2016 05:26
Ooof, once we release this into the wild we're going to have to be really cognizant of any changes to assembly objects. Pulled the new assembly with the change from _ipclusterid string to _ipcluster dict, and i had to hack the source to make it so i didn't have to start over at step2 on the full run with glenn's data. Probably a good idea to keep thinking about making load_assembly() smarter in handling updates.
Deren Eaton
@dereneaton
Feb 11 2016 05:40
Whoops. Good point. I'll be more careful about modifying the object.
Deren Eaton
@dereneaton
Feb 11 2016 15:59
Line 913 is assembly.py should ensure that ipyclient processes always close:
        finally:
            try:
                ## pickle the data obj
                self.save()                
                ## can't close client if it was never open
                if ipyclient:
                    ipyclient.close()
            except UnboundLocalError:
                pass
It's in the _clientwrapper func
so I'm not sure what's going on...
Deren Eaton
@dereneaton
Feb 11 2016 16:38
have you been using preview mode lately... I'm using it on step2 right now and it seems to be working the whole file.
Deren Eaton
@dereneaton
Feb 11 2016 16:47
which step were you running when the too-many-open-files problem arose? Maybe it's not a problem with ipyclient and we just leave a ton of files open somewhere... For example, we create a ton of tmp *.ali files in the name-tmpalign/ dir. We should probably do some kind of more efficient chunking to create fewer files.
Isaac Overcast
@isaacovercast
Feb 11 2016 16:49
I'm not convinced it wasn't just my computer, i was running other jobs for another project, I think i just made the tix cuz it spooked me, but i'm not going to stress too hard unless I can reliably reproduce it.
re: preview mode the default value for truncate length is optimized for step1. so in step2 yeah it's probably using most of the file, we should probably switch step2 (and step3) to use ["preview_truncate_length"]/10 or something like that.
Deren Eaton
@dereneaton
Feb 11 2016 17:27
I think I found the bug that was creating all of the open file handles
Deren Eaton
@dereneaton
Feb 11 2016 17:50
fixed.
fixed a bunch of stuff that could be causing major CLI problems with loading assembies, you might want to pull in right away.
Isaac Overcast
@isaacovercast
Feb 11 2016 18:18
Got it, thx.
Deren Eaton
@dereneaton
Feb 11 2016 18:38
welp, pushing again. Some major confusion with full versus short paths to saved assemblies.
think I got it under control
Deren Eaton
@dereneaton
Feb 11 2016 21:22
I think I finally got the pair merging down right. Was totally effed up.
now the pairGBS is reverse complement clustering beautifully.
Isaac Overcast
@isaacovercast
Feb 11 2016 22:12
sweet. I discovered a stupid bug on my cluster. nfs creates hidden .nfs-000watever files inside directories, when ipyrad tries to clean up the tmp directories it throws an os error because these files are "busy". I want to wrap calls to rmtree in a try/except to handle this, otherwise the run dies. figure it's better to leave an (empty) tmp directory hanging, than to let the run die (which otherwise completed nicely).
Deren Eaton
@dereneaton
Feb 11 2016 22:14
what is nfs?
nm, google told me.
why does it make those hidden files?
Isaac Overcast
@isaacovercast
Feb 11 2016 22:15
Idk, i'll try to figure it out
Isaac Overcast
@isaacovercast
Feb 11 2016 22:34
http://nfs.sourceforge.net/#faq_d2. My guess is these files were created when step 3 was crashing hard for me, muscle_align opens a temp file, errors out and then the finally block cleans up the open file, but the ipyclient still has an open reference. This is an annoying bug. In practice it should never come up (assuming all steps are working).
Deren Eaton
@dereneaton
Feb 11 2016 22:35
yeah, it's probably from the bug that was leaving too many ipyclients open.
Isaac Overcast
@isaacovercast
Feb 11 2016 22:36
The quick fix is to wrap rmtree calls in a try/except and leave the tmp directories hanging. The "real" fix is to go in and make sure all open()'d files are closed inside finally blocks, or something like that, which seems really annoying
Deren Eaton
@dereneaton
Feb 11 2016 22:38
hmm, so we can't rmtree the tmp-align dir b/c some of the .ali files are open?
Isaac Overcast
@isaacovercast
Feb 11 2016 22:39
Only in really weird edge cases
Deren Eaton
@dereneaton
Feb 11 2016 22:50
in the finally block, before rmtree, can you have a try block to close all .ali files?
Isaac Overcast
@isaacovercast
Feb 11 2016 23:20
Yeah, probably something like that. I'll clean it up.
Isaac Overcast
@isaacovercast
Feb 11 2016 23:27
This message was deleted
-    if args.force or '1' in args.steps:
 +    ## if forcing and doing step 1 then do not load existing Assembly
 +    if args.force and '1' in args.steps:
Isaac Overcast
@isaacovercast
Feb 11 2016 23:33
main isn't creating new assemblies unless you use the force flag. Part of me just wants to say "If 1 is in args.steps then create a new assembly", we guard against creating assemblies with the same name, so if somebody messes up and passes step 1 to an assembly that already exists it'll error out, won't create a new assembly over top of the existing one. Force isn't doing anything in this context, i don't think.
Isaac Overcast
@isaacovercast
Feb 11 2016 23:56
preview_truncate_fq is used by preview mode for steps 1, 2, and 3:
dereneaton/ipyrad@5597606
This change breaks it for steps 1 and 3. Was thinking an option would be to change the preview_truncate_length to be a percentage, rather than a stupid number, that way it'll slice off a chunk no matter how big the input files are.
Deren Eaton
@dereneaton
Feb 11 2016 23:59
Depends which version you are using, I made a bunch of fixes to that today
to the '1' and/or arg.force, that is.
oh you are referencing that...