These are chat archives for dereneaton/ipyrad

3rd
Mar 2016
Deren Eaton
@dereneaton
Mar 03 2016 14:46
Have you seen this error in step 6?:
   403             ## to the locus, and instead the locus is masked for exclusion
    404             ## using the filters array.
--> 405             local[iloc-start] = catarr[int(ask[4]), :icatg.shape[1], :]
    406             #icatg[iloc] = catarr[int(ask[4]), :icatg.shape[1], :]
    407         elif ask.shape[0] > 1:

IndexError: index 10000 is out of bounds for axis 0 with size 10000
Isaac Overcast
@isaacovercast
Mar 03 2016 15:30
I haven't see that error. That's the same line where I was seeing the ValueError I mentioned earlier, maybe coincidentally.
Isaac Overcast
@isaacovercast
Mar 03 2016 15:36
If step 3 doesn't finish clustering all samples then none of the samples get set to 2.5 . I'm seeing this behavior on the LSU cluster where they have wall time limits that are shorter than the amount of time needed to cluster all samples. I'mt trying to figure out a way to update sample.stats.state at runtime so rerunning an interrupted step3 will actually recognize samples with clust.gz files. It's tricky.
Deren Eaton
@dereneaton
Mar 03 2016 15:50
clustering is usually pretty fast, it's the damn aligning that takes so long. Seems like we should do a data.save() when all of the clustering is finished, and before aligning starts.
Deren Eaton
@dereneaton
Mar 03 2016 16:08
I think I found a fix for the problem I pasted above.
Isaac Overcast
@isaacovercast
Mar 03 2016 16:10
sick. what was it?
What do you think about iterating over the asyncResults from map so we can update sample state as they actually complete?
            results = threaded_view.map(clustall, submitted_args)
            for i, success in enumerate(results):
                if success:
                    LOGGER.debug("Finished clustering {}".format(samples[i]))
                    samples[i].stats.state = 2.5
Deren Eaton
@dereneaton
Mar 03 2016 16:12
I was using 10000 as the size of a small array that gets filled up before writing to the h5 array to decrease the amount of I/O. And the local array reaches 10K it should be wiped. But I was wiping it after it tried sampling the 10000th thing. So I just had to move the code for the wiping up a tiny bit.
Isaac Overcast
@isaacovercast
Mar 03 2016 16:13
i see.
Deren Eaton
@dereneaton
Mar 03 2016 16:13
That would be great. I think the way to do it is to use apply instead of map. This allows you to add one job at a time, so we would query for finished results, clean them up, and then add the next job.
Deren Eaton
@dereneaton
Mar 03 2016 16:23
You can find an example in the svd4tet code
For right now, since our 'soft launch' is happening this weekend at the workshop in Idaho, we should probably try to avoid pushing any major changes without really thorough testing, since we're punitively going to have real users very soon.
putatively*
Isaac Overcast
@isaacovercast
Mar 03 2016 16:31
Lol, no "punitively" is probably right too :p
Deren Eaton
@dereneaton
Mar 03 2016 22:06
Did Glenn make his GBS data set at LSU?
we're doing some GBS preps here and I'd be interested to compare protocols.
and to know what kind of size selection they were aiming for. Which cutter did he use?
Isaac Overcast
@isaacovercast
Mar 03 2016 23:18
TGCAG
I'll ask about size selection, etc.