These are chat archives for dereneaton/ipyrad

6th
Sep 2016
Shea Lambert
@SheaML
Sep 06 2016 18:07
Hi Deren and Isaac -- just wondering if you are planning to include hierarchical clustering, ala pyrad original, in ipyrad. Thanks!
Emily Warschefsky
@ewarschefsky_twitter
Sep 06 2016 19:23
Hey Deren - is there any way that you can return the option to use a wildcard in the barcodes file path? The option existed in the last version of pyrad, but for some reason it doesn't work anymore and I get the following error:
Error: barcodes file not found. This must be an absolute path
(/home/wat/ipyrad/data/data_barcodes.txt) or relative to the directory
where you're running ipyrad (./data/data_barcodes.txt). You entered:
/scratch/ewars001/ipyrad/Lane1/Sample_9/*barcodes.txt
Emily Warschefsky
@ewarschefsky_twitter
Sep 06 2016 19:52
Also, I am getting the following printing to my screen, which doesn't make sense: Note: barcodes XH:GACGCGTGA and XH:GACGCGTGA are within 2 base change of each other Ambiguous barcodes that match to both samples will arbitrarily
be assigned to the first sample. If you do not like this idea
then lower the value of max_barcode_mismatch and rerun step 1
Deren Eaton
@dereneaton
Sep 06 2016 19:55
Hi Emily, 1. Yes we can change it to allow fuzzy match to barcode file names again. 2. Do you maybe have a barcode repeated in your barcodes file?
Emily Warschefsky
@ewarschefsky_twitter
Sep 06 2016 19:56
Nope, I checked the barcodes files . It is doing it a bunch of times for each barcode
Deren Eaton
@dereneaton
Sep 06 2016 20:00
When I run it with two barcodes that are similar it gives a message with the two different barcodes, e.g.,
        Note: barcodes 3M_0:GTGTGA and 3L_0:GTGTGT are within 2 base change of each other             Ambiguous barcodes that match to both samples will arbitrarily
             be assigned to the first sample. If you do not like this idea 
             then lower the value of max_barcode_mismatch and rerun step 1
do you see two different barcodes in your case, or the same ones printed twice like in your example?
Oh I see that it does print the same one twice, among the many other warnings, we can fix that. But is it warning you that other barcodes are within two bases too?
Deren Eaton
@dereneaton
Sep 06 2016 20:11
It is actually a little unintuitive how to count the differences between barcodes. For example, even though GGGATT and TGTAGT have 3 base differences between them, the barcode GGTACT, which may be observed in your data, is 2 bases away from both of them, and so cannot be definitively assigned. Usually to allow 2 base differences among a large set of barcodes they will have to be longer (e.g., 10 bases) to have more combinations. Thanks for pointing this out though, we'll try to improve the warning message so it is more clear.
@ewarschefsky_twitter
Emily Warschefsky
@ewarschefsky_twitter
Sep 06 2016 20:46
I just see the warning with the same barcode listed twice, but it gives me the warning multiple times for each barcode - I guess this is normal and I should ignore it for now? We designed our barcodes to be more than 2bp apart, so it should be fine.
Deren Eaton
@dereneaton
Sep 06 2016 20:54
@ewarschefsky_twitter Our general recommendation is to set the max_barcode_mismatch parameter at 0 or 1, unless you are losing a lot of data from mismatches because of low quality reads in the barcode portion, i.e., the barcods contain frequent Ns. It is better to exclude a little data than it is to include a little incorrect data. And like I said above, some of your data could be ambiguously assigned if there are possible barcode sequences (not the ones you designed, but ones which arise due to errors) that are within two base differences of your true barcodes. You can always try it both ways and check the stats output file (in {workdir}/{name}_fastqs/s1.demultiplex_stats.txt) to see how often barcodes are being assigned that are more than 1 base difference from your barcodes. I hope that is more clear. Cheers,
Emily Warschefsky
@ewarschefsky_twitter
Sep 06 2016 20:55
@dereneaton - great, thanks for the quick response!
Edgardo M. Ortiz
@edgardomortiz
Sep 06 2016 22:55
@ewarschefsky_twitter I had the problem of losing too much data when demultiplexing with 0 mismatches, and like the warning says pyrad will assign the sample to the first barcode that matches when you allow for more mismatches. My solution was to use deML (https://github.com/grenaud/deML) for demultiplexing with up to 2 mismatches, the fastqs need some preparation but if you want to try it I can send you my bash scripts.