These are chat archives for dereneaton/ipyrad

28th
Mar 2018
Paolo Momigliano
@PaoloMomigliano_twitter
Mar 28 2018 08:19
@isaacovercast Thanks very much for your reply. Indeed, we are getting may more loci from what i can see (i am step 5 of the pipeline right now). Yet it does seem that now step 6 is very slow. And the speed does not seem to scale up appropriately with more cores (8 vs 16 vs 24 on a single node). Also, i was wondering whether step 6 (vsearch) can be spread effectively using multiple nodes, as till now i have not been able to do so. i Can easily run (reference assembly) step 1-5 effectively in multiple nodes, but step 6 is a real bottleneck here. These are RAD data made with PstI on a 0.5 Gb genome, so there are hundreds of thousands of loci: with 142 individuals (and about 400 000 clusters in step 3), step 6 will take about 2 weeks using 16 cores. Is this usual, and Is there any way to speed up step 6? Thanks so much for the help!
Ollie White
@Ollie_W_White_twitter
Mar 28 2018 10:16

Thanks for your reply @isaacovercast, yea I'm not quite sure what I've manged to do wrong here, but appreciate your thoughts on this. See ls -l of working directory, the fastqs directory and params file.

Working directory:

ls -l
total 4822272
-rw-r--r-- 1 oww1v14 bj 2418642962 Feb 16 17:05 MAC_RAW_GBS00289_L7_R1_data.fq.gz
-rw-r--r-- 1 oww1v14 bj 2518933734 Feb 16 17:08 MAC_RAW_GBS00289_L7_R2_data.fg.gz
-rw-r--r-- 1 oww1v14 bj        108 Feb 21 09:25 MD5.txt
-rw------- 1 oww1v14 bj        468 Feb 21 09:47 barcodes.txt
-rw------- 1 oww1v14 bj       4411 Mar 21 14:48 des.json
drwx------ 2 oww1v14 bj       4096 Mar 21 14:47 des_fastqs
-rw------- 1 oww1v14 bj       4910 Mar 21 14:48 ipyrad_log.txt
-rw------- 1 oww1v14 bj          0 Mar 21 14:44 job-ipyrad-12.e5137251
-rw------- 1 oww1v14 bj      49791 Mar 21 14:48 job-ipyrad-12.o5137251
-rw------- 1 oww1v14 bj       3035 Feb 22 11:49 params-des.txt
drwx------ 3 oww1v14 bj       4096 Feb 22 11:52 reference-plastid-genomes
-rw------- 1 oww1v14 bj        430 Mar 21 12:59 script-ipyrad.pbs
-rw-r--r-- 1 oww1v14 bj       1050 Feb 21 09:25 seq-info.txt

The _fastqs file is empty

ls -l des_fastqs/
total 0
And finally the params file
cat params-des.txt
------- ipyrad params file (v.0.7.21)-------------------------------------------
des                            ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
./                             ## [1] [project_dir]: Project dir (made in curdir if not present)
./MAC_RAW_GBS00289_L7_R*       ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
./barcodes.txt                 ## [3] [barcodes_path]: Location of barcodes file
                               ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo-reference               ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)
./reference-plastid-genomes/reference-plastid-genome.fasta ## [6] [reference_sequence]: Location of reference sequence file
pairddrad                      ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TA, GCGC                       ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
6                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.85                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
2                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
5                              ## [19] [max_Ns_consens]: Max Ns (uncalled bases) in consensus (R1, R2)
8                              ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
4                              ## [21] [min_samples_locus]: Min # samples per locus for output
20                             ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
8                              ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.2                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
*                              ## [27] [output_formats]: Output formats (see docs)
                               ## [28] [pop_assign_file]: Path to population assignment file[oww1v14@cyan01 descurainia.GBS.ipyrad]$ ls -l des_fastqs/
Nathan Layman
@northbynate
Mar 28 2018 16:42
@isaacovercast no problems to report just wanted to say you rock man! It's nice to see the level of help you're willing to give people when they run into problems.
Isaac Overcast
@isaacovercast
Mar 28 2018 21:26
@northbynate :+1: thanks!