These are chat archives for dereneaton/ipyrad

17th
Jul 2017
Nikki Phair
@phair_nikki_twitter
Jul 17 2017 12:48
Hi all, I have pooled ezRAD data (multiple individuals sequenced as one sample) and I'm considering using ipyrad to analyses this. Does anyone know if ipyrad can handle pooled data? Someone suggested ipyrad to me in a workshop... At the moment we are using PoPoolation but it seems to be giving us lots of trouble. Thanks in advance
Zahra
@nematiz
Jul 17 2017 14:20

Hi everyone,

I am running ipyrad but I got the following error and it has been stuck on Step 6.

Step 6: Clustering at 0.9 similarity across 165 samples
[ ] 0% building clusters | 0:00:00 sort: cannot read: /data/GBSall.utemp: No such file or directory
[####################] 100% building clusters | 0:04:04
[####################] 100% aligning clusters | 0:19:21
Encountered an error (see details in ./ipyrad_log.txt)
Error summary is below -------------------------------
error in step 6 IndexError(boolean index did not match indexed array along dimension 0; dimension is 108 but corresponding boolean dimension is 107)

Has anyone any idea?
Thanks in advance

Deren Eaton
@dereneaton
Jul 17 2017 14:37
Hi @nematiz, which version are you using?
Deren Eaton
@dereneaton
Jul 17 2017 14:48
Hi @phair_nikki_twitter, (1) I think ipyrad has the best methods for analyzing ezRAD data for sure. That type of data often has a lot of partial overlap of reads, or merging, due to the use of common cutters, and also reverse complement matching of paired-reads, all of which ipyrad deals with much better than other software does. (2) As for pooled data, ipyrad can 'handle' pooled data somewhat, it depends what you hope to get as a result. ipyrad's basecaller assumes diploidy, and so it assumes that in a heterozygote the alternative allele will be present in around 50% of reads. If it comes up in only 5% of reads it will likely be called as an error. So ipyrad is likely to miss heterozygote calls for rare variants in pooled populations.
Zahra
@nematiz
Jul 17 2017 14:53
Hi @dereneaton I am using 0.6.27.
Deren Eaton
@dereneaton
Jul 17 2017 14:57
Hi @nitishnarula , The DP value is based on the number of bases that are available to make the basecall at that site. The lower number of bases at site 3 is probably due to indel variation at that site in the aligned cluster. The fact that you are getting a high occurrence of a third base in a lot of SNPs in this cluster is worrisome. Is it like that in a lot of the loci? Usually you expect most SNPs to be biallelic. It could simply be that this locus is poorly aligned in the few first bases, or it could be that the barcodes/adapters are not being properly trimmed from the left end.
Nitish Narula
@nitishnarula
Jul 17 2017 16:09

Thanks for the explanation, @dereneaton ! I'll look into how many SNPs are biallelic and how many have a third base. Here's that example locus from the .loci file:

sample1     AGRAAAG-GGAGGGGGATCGGAAGAGCACACGTCTGAAC
sample2     RGRGAAGAGGAGGGGGATCGGAAGAGCACACGTCTGAAC
sample3     KGAAMAG-TGAGGGGGATCGGAAGAGCACACGTCTGAAC
sample4     ARMMMNG--GAGGGGGATCGGAAGAGCACACGTCTGAAC
//          *-*-*   -                              |192|

Again, this is just one example I pulled out. In our downstream analysis this locus would not be included since I would filter it out based on the number of samples (NS) in this locus (only 4 out of 145). Another filter could be based on proportion of biallelic SNPs within a locus, as you mention.

Nitish Narula
@nitishnarula
Jul 17 2017 16:17
SNP-position.png Another thing I have seen is that the position of these SNPS in the dataset is kind of skewed to the end of the loci. Of course this example shows the opposite. But we would expect a uniform distribution, right?
Deren Eaton
@dereneaton
Jul 17 2017 16:49
@nitishnarula In general, yes we should expect a uniform distribution of SNPs along the length of a locus. However, the ability to impute indels during alignment as a way to explain differences between sequences is decreased at the very edges, since the following sequence is absent, and so there is often a slight increase at the ends (at least for denovo assembled data). This is why we have an option to trim N base pairs off the ends after alignment.
@nitishnarula One way to find out more about what a clustered locus looks like wihtin a sample -- which might tell you more about whether more stringent filtering is needed -- is to look at the clust files themselves. You can find these in the "clust" folder in the files that end with .clustS.gz. For example, you can grep the lines that have the invariable part of the sequence from locus 192 with the command below, which will also show a few lines before and after the sequence:
deren@oud:~/Documents/ipyrad/tests$ zcat cli_clust_0.85/1A_0.clustS.gz | grep CACTTAGGCCGAATCGCCATCCGTTT -B2 -A2
//
0129ce420d1a253f505c6607939c0b82;size=17;*
TGCAGTAAAGCTCTCTACATAGTAATTCGAACGCGACGGTCTTCGGCAGCTCGCGGTCCGATCCCCACTTAGGCCGAATCGCCATCCGTTT
030e1748949cf1b2641e1c25e101fa59;size=1;+
TGCAGTAAGGCTCTCTACATAGTAATTCGAACGCGACGGTCTTCGGCAGCTCGCGGTCCGATCCCCACTTAGGCCGAATCGCCATCCGTTT
039ddae2d3c987d32c568f2e11e8643b;size=1;+
TGCAGTAAAGCTCTCTACATAGTAATTCGAACGCGACGGTCTTCGGCAGCTCGCGGTCCGAGCCCCACTTAGGCCGAATCGCCATCCGTTT
48b153b8418c7efdb82e84e8404a20f7;size=1;+
TGCAGTAAAGGTCTCTACATAGTAATTCGAACGCGACGGTCTTCGGCAGCTCGCGGTCCGATCCCCACTTAGGCCGAATCGCCATCCGTTT
8aea96d03714898900d679c3d47e2489;size=1;+
TGCAGTAAAGCTCTCTACATAGTAATTCGAACGCGACGGTCTTCGGCAGCTCGCGGTCCGATCCCCACTTAGGCCGAATCGCCATACGTTT
fc3561180e13b466d9cb4475018f9fa0;size=1;+
TGCAGTAAAGCTCTCTACATAGTAATTCGGACGCGACGGTCTTCGGCAGCTCGCGGTCCGATCCCCACTTAGGCCGAATCGCCATCCGTTT
//
//
Deren Eaton
@dereneaton
Jul 17 2017 16:56
The names of the sequences are hashed, so they're not meaningful. But you can see that the first sequence occurred 17 times, and then there were a number of singletons that also matched in this cluster. They all start with the restriction overhang TGCAG. If there's something weird going on at the beginning or end of your sequences and this will help to find that. Depending on your datatype (e.g., RAD, ddRAD, 2bRAD) there are different ways of trimming the ends that may be needed.
Isaac Overcast
@isaacovercast
Jul 17 2017 17:56
@nematiz Did you try killing it and rerunning with the -f flag?
Bohao Fang
@fangbohao_twitter
Jul 17 2017 20:20
Hi @dereneaton ! Can I select (filter) certain chromosomes by ipyrad?
Deren Eaton
@dereneaton
Jul 17 2017 20:45
Hi @fangbohao_twitter, not super easily that I can think of. What do you have in mind?
Bohao Fang
@fangbohao_twitter
Jul 17 2017 20:49
Hi @dereneaton, I would like to remove sex chromosomes directly by ipyrad. Because in this way, I can get other format outfiles at once instead of playing VCF file around with other tools.
Isaac Overcast
@isaacovercast
Jul 17 2017 20:50
@fangbohao_twitter There is no straightforward way to do this internal to ipyrad. vcftools is actually really good at manipulating vcf files, so try that.
Deren Eaton
@dereneaton
Jul 17 2017 20:51
If you planned to do this from the beginning you could edit your reference genome file so that it does not include the sex chromosomes, and then only reads which match the non-sex chromosomes will be included.
You could still do that now, you would just need to create a new branch with the edited reference file and restart from step 3 with force=True.
and you could do the opposite (only include sex chromosomes in the reference) if you wanted to get only the RAD data matching to sex chromosomes in your output.
We'll keep that in mind as something to potentially make as a filtering tool in the future, though. But it probably won't happen very soon.
Bohao Fang
@fangbohao_twitter
Jul 17 2017 20:55
@dereneaton thanks for your explanation! I suppose it is easier to use the vcftools now.