These are chat archives for dereneaton/ipyrad
whatever_clust_0.9folder - that is how I figured out that at the 0.90 threshold, it ran for 6 days and was writing and updating the htemp, utemp, and clust.gz and clustS.gz files as usual, however, it continued running for 5 more days (before I got cut off because I ran out of cluster time) and didn't write to any of the files in that folder for the whole 5-day period.
@dereneaton - thanks for the reply - regarding your thoughts:
(1) I am using multiple nodes with MPI, but it did initialize correctly (I double checked with my HPC administrator)
(2) I'm not sure what qualifies as "agressively" trimming/filtering my data. At 0.95, the first column of the step 3 params file says the samples had between 2,810,478 (a huge outlier - most had <1,000,000) and 301,057 clusters. I pasted the params for the 0.95 run below (I branched after step 2 to run step 3 at different clustering levels) - let me know if you think any of these should be modified. I did check the reads with FastQC and they looked fine.
denovo ##  [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference) ##  [reference_sequence]: Location of reference sequence file ddrad ##  [datatype]: Datatype (see docs): rad, gbs, ddrad, etc. CATG, AATT ##  [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2) 5 ##  [max_low_qual_bases]: Max low quality base calls (Q<20) in a read 33 ##  [phred_Qscore_offset]: phred Q score offset (33 is default and very standard) 6 ##  [mindepth_statistical]: Min depth for statistical base calling 6 ##  [mindepth_majrule]: Min depth for majority-rule base calling 1000 ##  [maxdepth]: Max cluster depth within samples 0.95 ##  [clust_threshold]: Clustering threshold for de novo assembly 1 ##  [max_barcode_mismatch]: Max number of allowable mismatches in barcodes 1 ##  [filter_adapters]: Filter for adapters/primers (1 or 2=stricter) 35 ##  [filter_min_trim_len]: Min length of reads after adapter trim 2 ##  [max_alleles_consens]: Max alleles per site in consensus sequences 5, 5 ##  [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2) 8, 8 ##  [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2) 9 ##  [min_samples_locus]: Min # samples per locus for output 20, 20 ##  [max_SNPs_locus]: Max # SNPs per locus (R1, R2) 8, 8 ##  [max_Indels_locus]: Max # of indels per locus (R1, R2) 0.5 ##  [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2) 0, 0 ##  [edit_cutsites]: Edit cut-sites (R1, R2) (see docs) 0, 0, 0, 0 ##  [trim_overhang]: Trim overhang (see docs) (R1>, <R1, R2>, <R2)
(3) This is just SE ddrad data right now
(4) I was submitting to the new high memory nodes our cluster just installed, so memory shouldn't be a problem, but I guess it could be.
On another note - I now have a draft genome sequence I could align to - I imagine that this would greatly reduce clustering time compared to the denovo assembly? However, since I'm doing a genus-level phylogeny, would this bias the clustering toward the species that we have the genome sequence for?