These are chat archives for dereneaton/ipyrad

21st
Mar 2017
Bohao Fang
@fangbohao_twitter
Mar 21 2017 09:41
@isaacovercast Yes, I did 'denovo+reference'. And I set a branch of step 6-7 to run new version ipyrad to get CHROM position.
Isaac Overcast
@isaacovercast
Mar 21 2017 12:15
@fangbohao_twitter If you ran an older version of ipyrad to generate the consensus base calls (step 5) then the CHROM/POS information will already have been thrown out. You need to re-run step 5 as well as 6 & 7 to get the chrom/pos information in the output.
Emily Warschefsky
@ewarschefsky_twitter
Mar 21 2017 14:03
@isaacovercast: the 0.95 threshold took a while to run step 3 (~12 days I think?). I did the ls ltr for the whatever_clust_0.9 folder - that is how I figured out that at the 0.90 threshold, it ran for 6 days and was writing and updating the htemp, utemp, and clust.gz and clustS.gz files as usual, however, it continued running for 5 more days (before I got cut off because I ran out of cluster time) and didn't write to any of the files in that folder for the whole 5-day period.
Deren Eaton
@dereneaton
Mar 21 2017 14:13
@ewarschefsky_twitter that is a long time for step 3 to run, even if you had hundreds of samples. Here are some ideas:
(1) are you using a single node, or multiple nodes with MPI? If the MPI did not initialize correctly (you did not load the MPI module on your system) then it may be trying to run 32 jobs at a time on a single core, instead of 32 jobs on 32 cores, and that will make everything run waaaay slower.
(2) Did you aggressively trim/filter your data? How many clusters were you getting per sample in the .90 clustering? For good clean data you might expect to get a few tens or hundreds of thousands of clusters. If you get several million then that is less ideal, since they will be almost all low-depth clusters. If the data has very low quality 3' ends, or has adapters in them, then it will not de-replicate or cluster efficiently, and you will be clustering millions of unique (but only unique due to errors) reads, rather than many fewer, since identical reads are collapsed and counted prior to clustering, which speeds and improves the process dramatically.
(3) Clustering of paired-end data, either pairddrad or pairgbs, takes longer than for single-end data, but this still seems like a long time for your data to run, and I would guess that one of the first two points above is slowing you down, since I've always seen clustering finish in much much less than 12 days.
(4) The fact that your job seems to have stalled at 56% and you see no files being written to suggests to me that your job has failed for some reason. It is rare for a job to stop while clustering, so I'm not sure what could have happened. When you submit the job to your cluster to you request the entire node (usually some kind of argument like "exclusive")? That can help to ensure no one else connects to your node and crashes it with a high-memory job or something.
elviscat
@elviscat
Mar 21 2017 15:14
@dereneaton got it, many thanks!!
Emily Warschefsky
@ewarschefsky_twitter
Mar 21 2017 15:15

@dereneaton - thanks for the reply - regarding your thoughts:
(1) I am using multiple nodes with MPI, but it did initialize correctly (I double checked with my HPC administrator)
(2) I'm not sure what qualifies as "agressively" trimming/filtering my data. At 0.95, the first column of the step 3 params file says the samples had between 2,810,478 (a huge outlier - most had <1,000,000) and 301,057 clusters. I pasted the params for the 0.95 run below (I branched after step 2 to run step 3 at different clustering levels) - let me know if you think any of these should be modified. I did check the reads with FastQC and they looked fine.

denovo                         ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)
                               ## [6] [reference_sequence]: Location of reference sequence file
ddrad                          ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
CATG, AATT                     ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
6                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
1000                           ## [13] [maxdepth]: Max cluster depth within samples
0.95                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
1                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
1                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
5, 5                           ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
8, 8                           ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
9                              ## [21] [min_samples_locus]: Min # samples per locus for output
20, 20                         ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
8, 8                           ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.5                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)
0, 0                           ## [25] [edit_cutsites]: Edit cut-sites (R1, R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_overhang]: Trim overhang (see docs) (R1>, <R1, R2>, <R2)

(3) This is just SE ddrad data right now
(4) I was submitting to the new high memory nodes our cluster just installed, so memory shouldn't be a problem, but I guess it could be.

On another note - I now have a draft genome sequence I could align to - I imagine that this would greatly reduce clustering time compared to the denovo assembly? However, since I'm doing a genus-level phylogeny, would this bias the clustering toward the species that we have the genome sequence for?