These are chat archives for dereneaton/ipyrad

28th
Nov 2016
Deren Eaton
@dereneaton
Nov 28 2016 08:17
@all/ v.0.5.8 is now available for Linux (and Mac soon). We've made some changes and done extensive testing to avoid memory limits and I believe this should no longer be a problem. For example, I just assembled a 360 taxon data set of paired-end 150bp reads using my laptop. Let us know if you run into any problems. Cheers,
Edgardo M. Ortiz
@edgardomortiz
Nov 28 2016 08:32
@dereneaton that is amazing, I will test it with that 666 taxon dataset that got stuck building the database in step 6!
R2C2.lab
@R2C2_Lab_twitter
Nov 28 2016 09:08
Hi @dereneaton, just a curiosity, how long did it take to run the 360-data et on your laptop, and what’s its memory? I ran a 5-sample data set on a cluster (64GB; 20 cores) and step 3 took 23 days ([####################] 100% clustering | 23 days, 16:10:00 ) and step 6 10 days only the clustering step [####################] 100% clustering across | 10:18:26. Is it because I have 300bp PE reads?
ViviSette
@ViviSette
Nov 28 2016 09:15

@ViviSette or maybe you tried starting from step 3 directly, even if your fastqs are demultiplexed and cleaned you have to start from step 1 in ipyrad. That happened to me :)

@edgardomortiz I did try to start from Step3 indeed so good to know that I have to anyway run 1 and 2 :) So the program handles anyway that you give it a path for non-demultiplexed files and instead there are only already demultiplexed ones? I'll try that and see if it works, thanks! :)

@dereneaton Hi Deren, thanks a lot for your reply. Unfortunately we don't have administration rights on the servers we work with so I need to do everything through an IT administrator - I guess he'll have just to be patient and update it for us every time we ask :)
I will try to follo @edgardomortiz suggestion, I was actually trying to start directly from Step3 so there shouldn't be any assembly name created if that should be the first step to run... I'll try and see, thanks a lot :)
ViviSette
@ViviSette
Nov 28 2016 10:45

Hello again... I'm sorry to write again but I keep getting errors and I don't manage to move from Step1 since a week now :(

If I run step12 now, step1 runs fine and then I get a message repeated several times saying
found an error in step2; see ipyrad_log.txt
but the log file is empty so I can't see what is wrong... I checked and checked the params file but I don't see what's the problem.

This are the fields of my params file involved in step2:
Lib3 ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
./ ## [1] [project_dir]: Project dir (made in curdir if not present)
./Crickets_Lib3_R*.fq ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
pairddrad ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc
5 ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read 6 ## [11] [mindepth_statistical]: Min depth for statistical 2 ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter) 35 ## [17] [filter_min_trim_len]: Min length of reads after adapter trim 0, 0 ## [25] [edit_cutsites]: Edit cut-sites (R1, R2) (see docs) 0, 0, 0, 0 ## [26] [trim_overhang]: Trim overhang (see docs) (R1>, <R1, R2>, <R2)

Thanks again for the help

Edgardo M. Ortiz
@edgardomortiz
Nov 28 2016 12:24
Could you post your complete params file? Also, I would recommend you to put the raw fastqs inside a subfolder in your assembly directory. If step 1 runs fine that means you are getting a folder ending in_fastqs containing your demultiplexed samples, could you verify they are actually there? is your barcodes file correct?
ViviSette
@ViviSette
Nov 28 2016 13:31

I am getting a folder ending in _fastqs which contains all the individuals, meaning that the barcodes file is correct I guess. I am not sure what you mean with putting the raw fastqs inside a subfolder in the assembly directory? The only directory created is the one ending in _fastqs and the fastq files are in there. Do you mean I should move the all Library non demultiplexed?
Here the all params:
Lib3_Final ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
./ ## [1] [project_dir]: Project dir (made in curdir if not present)
./Crickets_Lib3_R*_FINAL.fq ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
./barcodes_Lib3.txt ## [3] [barcodes_path]: Location of barcodes file

                            ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files

denovo ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)

                            ## [6] [reference_sequence]: Location of reference sequence file

pairddrad ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.

                            ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2) TGCAG, CCT

5 ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33 ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6 ## [11] [mindepth_statistical]: Min depth for statistical base calling
6 ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000 ## [13] [maxdepth]: Max cluster depth within samples
0.95 ## [14] [clust_threshold]: Clustering threshold for de novo assembly
1 ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
2 ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35 ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2 ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
5, 5 ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
8, 8 ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
5 ## [21] [min_samples_locus]: Min # samples per locus for output
10, 10 ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
8, 8 ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.6 ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)
0, 0 ## [25] [edit_cutsites]: Edit cut-sites (R1, R2) (see docs)
0, 0, 0, 0 ## [26] [trim_overhang]: Trim overhang (see docs) (R1>, <R1, R2>, <R2)
l, p, s, v ## [27] [output_formats]: Output formats (see docs)
./Popfile.txt ## [28] [pop_assign_file]: Path to population assignment file

Edgardo M. Ortiz
@edgardomortiz
Nov 28 2016 14:10
@ViviSette why didn't you specify the overhangs, did you remove them in advance? I would still specify them to help with the adaptor cleaning, otherwise I don't see anything strange. About the non-demultiplexed fastq file location it was just a suggestion to put it in a subfolder, I don't think it will affect anything, especially since your individuals were demultiplexed successfully.
ViviSette
@ViviSette
Nov 28 2016 14:24
The adaptors were removed by the sequencing company as a first cleaning step, I can add them and try again but I doubt that Step2 will start working since they are actually not a mandatory field... The log file keep being empty so I don't know what is the problem :(
I am wondering whether is the usual "No ipcluster instance found. This may be a problem with your installation setup. I would recommend that you contact the ipyrad developers." that I kept getting every time I tried to run some steps? Of course I don't know if that's the case as I can't see it from the log, just guessing... Any idea how to proceed from here?
I don't know if it provides any useful information but the error
"found an error in step2; see ipyrad_log.txt"
is repeated 55 times on the screen, i.e. exactly as many samples I have demultiplexed...
Edgardo M. Ortiz
@edgardomortiz
Nov 28 2016 15:22
Oh I see, you must start your ipyrad run with something like this, @dereneaton helped me to set it up for our cluster.
ipcluster start --n=24 --profile=ipyrad --daemonize && sleep 60
ipyrad -p params.txt -s 1234567 --ipcluster
Change --n to your number of cores and the ipyrad instruction to run whatever steps you want to run
Deren Eaton
@dereneaton
Nov 28 2016 15:30
@ViviSette, I would strongly recommend that you install ipyrad yourself on your cluster following the instructions in the documentation (http://ipyrad.readthedocs.io). You do not need administrator privileges to install in this way. This is necessary to ensure that all of the necessary packages that ipyrad relies upon are available to you. After that, try running one of the tutorial data sets that are explained in the documentation to learn more about the parameter settings, and to test whether the problem is withyour settings and data or whether there is still a problem with the installation.
Deren Eaton
@dereneaton
Nov 28 2016 15:56
@R2C2_Lab_twitter it took about 16 hours on a 4-core laptop with 16GB RAM for steps 1-7. The data set only had 10K loci, so in that dimension it was smaller than many data sets, but nloci is not the factor that we expect to cause memory limits. The speed of step3 will depend heavily on a number of factors. In this case the step3-clustering only took about 30 minutes, while step3-aligning took about 10 hours. This is because aligning sequences for 360 taxa obviously takes much longer. The aligning step is super easy to parallelize, though, so had I run it on a 40-core cluster it would have only taken ~1 hour.
If your data set has millions of unique fragments that do not cluster together then step3-clustering will take much longer, but step3-aligning will probably be super fast, since there will be little to align.
In general, you hope to have fewer clusters that have many reads in them, as opposed to millions of clusters at very low depth.
Deren Eaton
@dereneaton
Nov 28 2016 16:07
@R2C2_Lab_twitter But the latter type happens quite frequently. But there should still be something we can do to make your analysis run faster. I'm surprised step3 could take >20 days, I expected it would take at most 1-2 days for just about any data set. But I haven't tested with 300bp reads, and longer reads will certainly slow things down, I hadn't expected that much slower than 150bp reads though. When you say 300-bp paired-end do you mean each read is 300bp? or each read is 150? Did you filter the data for adapters in step2?
R2C2.lab
@R2C2_Lab_twitter
Nov 28 2016 16:12
Each read is 300 bp and I filtered for adapters in step 2
2 ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
Deren Eaton
@dereneaton
Nov 28 2016 16:14
What kind of stats did you get for step3, like nclusters/sample and avg. depth?
R2C2.lab
@R2C2_Lab_twitter
Nov 28 2016 16:17

regina@ccmar-r2c2-01:~/ipyrad_individual> ipyrad -p params-individual.txt -r

Summary stats of Assembly individual

                    state  reads_raw  reads_passed_filter  reads_merged  \

MC-RB-1-RPB_val_seqtk 3 2620553 2618605 2618605
MC-RB-2-RB_val_seqtk 3 2834954 2832907 2832907
MC-RB-3-YOP_val_seqtk 3 3322935 3320462 3320462
MC-RB-4-YOWB_val_seqtk 3 3501355 3499050 3499050
MC-RB-5-YOB_val_seqtk 3 3162933 3160847 3160847

                    clusters_total  clusters_hidepth  

MC-RB-1-RPB_val_seqtk 1443619 289448
MC-RB-2-RB_val_seqtk 1608037 309580
MC-RB-3-YOP_val_seqtk 1799845 372295
MC-RB-4-YOWB_val_seqtk 1975561 389646
MC-RB-5-YOB_val_seqtk 1672337 361984

Full stats files

step 1: None
step 2: ./individual_edits/s2_rawedit_stats.txt
step 3: ./individual_clust_0.85/s3_cluster_stats.txt
step 4: None
step 5: None
step 6: None
step 7: None

Deren Eaton
@dereneaton
Nov 28 2016 16:20
oh, nice, it looks like you got a ton of data!
Deren Eaton
@dereneaton
Nov 28 2016 16:27
@R2C2_Lab_twitter There is not much that can be done to make the clustering run faster, that step is done using the vsearch software, which is pretty much the fasting thing around. The clustering step is now parallelized a bit more efficiently that it was a few versions back, so that if you submit 5 samples to be clustered on 20 cores each sample will be allotted 4-threads in vsearch. And that should speed things up versus when each sample was alloted 1 thread. But no matter what, if you have 2.6M reads that cluster into 1.4M clusters, then most of your reads are probably occurring as singletons. This means that your data set, despite having tons of reads, is actually under-sequenced, because the low depth clusters should be discarded (in my opinion, though you could use a very low minimum depth setting), and therefore most of your missing data in the final alignment will probably be due to low coverage, rather than mutation-dropout. Is your datatype set to pairddrad or pairgbs? The pairgbs method is also slower because it has to test for reverse-complement matches during clustering.
R2C2.lab
@R2C2_Lab_twitter
Nov 28 2016 16:30
pairgbs; and when you say that I could use a low minimum depth setting you mean ## [12] [mindepth_majrule] < 2? I’m using 2
Deren Eaton
@dereneaton
Nov 28 2016 17:04
Oh, then you are already using a low value.
low being less than 5, since at that point you don't really have statistical power to distinguish sequencing errors from heterozygotes.
But it probably makes sense to use a low mindepth setting for your data set since the avg depth is very low.
R2C2.lab
@R2C2_Lab_twitter
Nov 28 2016 19:05
@dereneaton that’s what I thought; so, I just need to be patient :-) thanks for all the explanations
Isaac Overcast
@isaacovercast
Nov 28 2016 19:16
@R2C2_Lab_twitter Yeah James had a very similar issue with his PE ezrad data (300 bp x2 PE). Long long long runtimes. @Cycadales_twitter might weigh in with some tips for dealing with this. One thing I would doublecheck is the quality of R2 with something like fastqc. For the PE 300 bp datasets I've seen the tail of R2 accumulates a ton of errors, so this could be causing sequences to erroneously fail to cluster.
R2C2.lab
@R2C2_Lab_twitter
Nov 28 2016 19:20
I checked R1 and R2 in fastQC and cut the first 6bp and the last 30bp with seqtk in both reads to get higher quality
Isaac Overcast
@isaacovercast
Nov 28 2016 21:46
@R2C2_Lab_twitter did you ever mention what clustering threshold you're using?
R2C2.lab
@R2C2_Lab_twitter
Nov 28 2016 22:02
I'm using 0.85