These are chat archives for dereneaton/ipyrad

27th
Jul 2017
Wind-ant
@Wind-ant
Jul 27 2017 06:21
Hi,@dereneaton,@isaacovercast,I have two questions.
The 1st,in the 3rd step of pyrad&ipyrad,the file .clsutS.gz of each sample contains many clusters of reads by their similarity,and behind the name of each read is something like this:size=6;,I checked the tutorial and found nothing about them,is the size about depth of that read?and what about the * or +?
The 2nd question,I have pair-end GBS data,but the number of loci and snp called by pyrad&ipyrad and Stacks is big difference,my raw result of Stacks is 100,000 level snp.but those of pyrad&ipyrad is just 1,000~10,000level snp(no matter how I adjust the parameters),I read many papers about GBS&pyrad&stack,many of them about GBS analysed by pyrad is almost same,the snp is 1,000~10,000 level.I suspect the reason is about the difference between the strategy of calling loci&snp of pyrad and stacks,pyrad treat the heterozygote loci with-in sample as N,and just call snp by the heterozygote loci exist across samples,however,stacks keep those heterozygote loci with-in sample,and so the snp is more than pyrad.
I'm not a English native speaker,so I don't know whether you can understand my questions,anyway,thank you very much if you can share your time talking about it.
Deren Eaton
@dereneaton
Jul 27 2017 11:01
Hi @Wind-ant, have you run the empirical (Pedicularis) tutorial that is in the ipyrad documentation? It has only 13 samples and so I would consider it to be a small data set, yet it stlil has >70,000 parsimony informative SNPs, and >170,000 SNPs. So there is nothing inherent to ipyrad that is limiting the number of SNPs that you recover. I'm guess instead that it has something to do with parameter settings that you are using.
In the clust files the number size=6 records that the exact sequence was observed 6 times in the data. It is just a more compact way than writing the same sequence six times. The * records which sequence was the seed of a cluster, meaning other reads were clustered to it, and the - or + characters record whether sequences matched to it in a forward or reverse-complement direction.
If you want you could tell us about your type of data and the parameter settings you are using and we could provide some advice on whether anything should be changed.
Wind-ant
@Wind-ant
Jul 27 2017 14:11

I have pair-edn GBS data of 76 individuals(73 ingroup,3 outgroup),and no reference,my parameters is as follws:
------- ipyrad params file (v.0.7.2)--------------------------------------------
c84m40 ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
./ ## [1] [project_dir]: Project dir (made in curdir if not present)

                           ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
                           ## [3] [barcodes_path]: Location of barcodes file

assembled_filter_fastq/*.fq ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)

                           ## [6] [reference_sequence]: Location of reference sequence file

merged ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TGCAG, TGCAG ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
4 ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33 ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6 ## [11] [mindepth_statistical]: Min depth for statistical base calling
6 ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000 ## [13] [maxdepth]: Max cluster depth within samples
0.84 ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0 ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
0 ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
50 ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2 ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
5, 5 ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
8, 8 ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
40 ## [21] [min_samples_locus]: Min # samples per locus for output
20, 20 ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
8, 8 ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
3 ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)
0, 0, 0, 0 ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs) 0, 0, 0, 0 ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
p, k, n, s, v ## [27] [output_formats]: Output formats (see docs)

                           ## [28] [pop_assign_file]: Path to population assignment file

I use PEAR to merge the F1 and F2 raw reads,then use Fastx_toolkit to cut the raw reads to 110bp and filter the low quality base,then use the merged fastq as input for ipyrad,actually,I jump the step 1 in ipyrad because I have no barcode file(thought the company have already help me clean it).So I start ipyrad from step 2 to the step 7 by parameters above.I adjust the threshold(0.8,0.9) and min-taxon coverage(30,40,50),and didn't work better enough,if taxon coverage set to 40,the number of snp is about 10,000,if 50, maybe 2,000(poor for downstream analysis).I don't know what's wrong about my parameters or my data.

Wind-ant
@Wind-ant
Jul 27 2017 14:35
If you need me to show more results about some steps for checking the problems,just tell me please.
Isaac Overcast
@isaacovercast
Jul 27 2017 17:23
@Wind-ant If you look in your output directory there is a stats file that will show you why loci are getting filtered. Here's an example from a dataset i'm working with now:
total_prefiltered_loci             126789              0         126789
filtered_by_rm_duplicates            1947           1947         124842
filtered_by_max_indels              12213          12213         112629
filtered_by_max_snps                 6603            933         111696
filtered_by_max_shared_het            278            124         111572
filtered_by_min_sample              16079          16016          95556
filtered_by_max_alleles              2104            963          94593
total_filtered_loci                 94593              0          94593
Here most of my loci are getting filtered by max indels and min sample per locus. I'm willing to bet your max_shared_Hs_locus setting is way too low. So you merged your PE data and you're running it as SE? Also, your setting for datatype looks weird:
merged ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
Isaac Overcast
@isaacovercast
Jul 27 2017 17:40
I would use GBS here.
Deren Eaton
@dereneaton
Jul 27 2017 18:03
@Wind-ant there are big differences in how ipyrad treats paired-gbs data versus how stacks does, and we believe the way we treat it is much better, which we can go into.
Katherine Silliman
@ksil91
Jul 27 2017 21:12

@isaacovercast @dereneaton Looking at the var and pis output in the s7 stats file, the numbers don't make sense to me. The var column should always be larger than the pis column (since it includes loci with both pis and autapomorphies), but my output looks like this (using v.0.6.19):

     var  sum_var   pis  sum_pis
0   1108        0  3541        0
1   2108     2108  4362     4362
2   2620     7348  3870    12102
3   2796    15736  3200    21702
4   2606    26160  2422    31390
5   2276    37540  1897    40875
6   1960    49300  1397    49257
7   1658    60906   957    55956
8   1357    71762   724    61748
9   1141    82031   561    66797
10   899    91021   391    70707

@nclowell and I want to use this information, as well as proportion of rm_duplicates, to compare runs at a range of clustering similarities from 70-98% in order to determine the best threshold for our two bivalve species (which have lots of repeats/high heterozygosity/are weird in general). So we'd like to plot the number of homozygous loci and the number of heterozygous loci on the Y-axis and clustering threshold on the X-axis and look at where they asymptope.

Isaac Overcast
@isaacovercast
Jul 27 2017 21:27
@all New Version: v.0.7.6: Fixes a nasty bug in SE reference assemblies (introduced in v.0.7.0). Update now plz.
Katherine Silliman
@ksil91
Jul 27 2017 21:29
@isaacovercast @dereneaton @nclowell Nevermind, I understand how it is set-up now. Both the var and pis columns sum to the total number of filtered loci.