These are chat archives for dereneaton/ipyrad

Aug 2017
Zac Forsman
Aug 23 2017 00:30
Hey folks,
Zac Forsman
Aug 23 2017 00:36
I'm still trying to figure out optimal settings for ipyrad for ezrad data... One of the problems I've been having is that de-novo assembly in ipyrad seems to take a very, very long time.... up to a month..... I've created a very small sample dataset based on mitochondrial reads: to try to find ways to boost speed with settings. I'm trying to do a reference assembly on this using default settings and it hangs at step 6 clustering across ... I've tried several things such as reformatting the reference sequence, or changing settings, it doesn't seem to matter and just hangs at this step. These data files are only a few kilobytes and the references are near perfect... any advice would be appreciated! -Zac
Zac Forsman
Aug 23 2017 00:49
@Cycadales_twitter, I've also been having problems with speed and ezRAD data... I have some massive libraries and I think if they have a touch of degraded DNA the number of loci really shoots up. The nice thing is that you can use these libraries to pull out a lot of large contigs of interest like complete mitochondrial genomes... the bad thing is that it is sloooooow. The dDocent pipeline is super fast, but it lacks a lot of the phylogenetic functionality of ipyrad. One potential approach I have been thinking of is building a reference in dDocent, then assembling to that using ipyrad... but the program seems to hang at step 6 (clustering across) and I'm not sure why. What kinds of settings have given you luck for ezRAD data? Thanks! -Zac
James Clugston
Aug 23 2017 13:29

@zacforsman Hi Zac, my libraries are also very large minimum of 250MB size x 2 per samples PE. Getting the settings really right has been quite a challenge as it does seem to take a very long time to run. I also initially used illumina NextSeq 500 which gave significantly poorer quality data than illumina HiSeq 4000. So the data needed to be cleaned using quite strict setting in TRIMMOMATIC. The problems I have had with my datasets is down to size files and the sear number of reads and problems with some of the clusters here in Edinburgh (a number of nodes keep dropping due to power cuts!). However, I find that using a 64 core server it takes around 3 weeks for a very large assembly to finish. These are the settings I am using ```------- ipyrad params file (v.0.7.1)--------------------------------------------
Cycas_final ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
Cycas ## [1] [project_dir]: Project dir (made in curdir if not present)

                           ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
                           ## [3] [barcodes_path]: Location of barcodes file

/mirror/1tb/james/Cycas_fastq/*fastq.gz ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)

                           ## [6] [reference_sequence]: Location of reference sequence file

pairgbs ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TAA, AATTC ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
15 ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
30 ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6 ## [11] [mindepth_statistical]: Min depth for statistical base calling
4 ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000 ## [13] [maxdepth]: Max cluster depth within samples
0.85 ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0 ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
2 ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
40 ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2 ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
10, 10 ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
8, 8 ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
4 ## [21] [min_samples_locus]: Min # samples per locus for output
30, 30 ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
12, 12 ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.5 ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)
0, 0, 0, 0 ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs) 0, 0, 0, 0 ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
p, s, u, n, k, g, v ## [27] [output_formats]: Output formats (see docs)

                           ## [28] [pop_assign_file]: Path to population assignment file```  
James Clugston
Aug 23 2017 13:41
Deren Eaton
Aug 23 2017 15:38
Hey @zacforsman, we definitely need to do more optimizing for working with long reads (>300bp), most of our testing is on SE or PE 100bp reads. However, the problem may be with the data format. It looks like your paired reads in the ipyrad_formatted_data.tar.gz archive are formatted strangely. The R1 and R2 reads are in the same file whereas they should be in separate files.
Deren Eaton
Aug 23 2017 17:27
@zacforsman I ran some tests on your data (see notebook below). I can get it to run through denovo pretty well, but I'm not finding many hits to the reference, like you said. I'll look into it a bit more and see if I can figure out what's going on.
Aug 23 2017 19:40
@dereneaton I am using the de novo assembly, an no since I ran step 3 I haven't updated my version to avoid any conflicting issues between versions.
Aug 23 2017 19:58

On the topic of problems encountered during step 6, I'm having issues with an assembly of 384 samples on a Knights Landing compute node, where vsearch aborts during step 6 due to insufficient memory (Fatal error: Unable to allocate enough memory). My first thought was that this is because of 68 cores on the KNL node and, as documented in the source, the fact that the invocation of vsearch uses all available threads. I'd have assumed that vsearch can only access threads within the ipcluster, but the problem persists after cranking down the number of cores to 1. In turns out that regardless of the cores in the ipcluster, vsearch erroneously finds 272 cores:

vsearch v2.0.3_linux_x86_64, 94.1GB RAM, 272 cores
/home1/03125/nspope/.local/miniconda2/lib/python2.7/site-packages/bin/vsearch-linux-x86_64 -cluster_smallmem /scratch/03125/nspope/melis_assembly/melis_across/melis_catshuf.tmp -strand plus -query_cov 0.75 -minsl 0.5 -id 0.85 -userout /scratch/03125/nspope/melis_assembly/melis_across/melis.utemp -notmatched /scratch/03125/nspope/melis_assembly/melis_across/melis.htemp -userfields query+target+qstrand -maxaccepts 1 -maxrejects 0 -fasta_width 0 -threads 0 -fulldp -usersort -log /scratch/03125/nspope/melis_assembly/melis_across/s6_cluster_stats.txt 
Started  Wed Aug 23 14:46:24 2017750073909 nt in 8122876 seqs, min 36, max 131, avg 92

The assembly/vsearch both run perfectly fine on a 16-core Sandy Bridge compute node (using all available cores). I know the architecture of KNL is different, and I'm trying to track down where exactly vsearch is going wrong. However, is there a reason why ipyrad can't invoke vsearch with the number of threads corresponding to the cores requested for the assembly at large? Thanks.

Deren Eaton
Aug 23 2017 20:24

Hi @nspope, interesting. Usually a single node doesn't have more than ~20 cores on it, so we haven't really encountered this problem. This should be easy to fix though using -t. We actually wrote a comment just above the line you highlighted to remind us to look into it because this could lead to memory problems:

## get call string. Thread=0 means all (default) but we may want to set
## an upper limit otherwise threads=60 could clobber RAM on large dsets.

I will write into the next release a workaround so that you can throttle the number of threads used by vsearch in this step using the -t flag, similar to the way it currently works in step 3. e.g.,

## soon to be available in v.0.7.12
## make 100 cores available in step 6 but run threaded job
## (e.g., vsearch) with max 20 threads. 
ipyrad -p params-test.txt -s 6 -c 100 -t 20
Aug 23 2017 20:56
@dereneaton Thanks, that'd be great. Am I correct that vsearch is allowed access to all threads, not just those in the cluster? It turns out there's 4 threads/core here, so 272 is indeed the total number of threads on this 68-core node. I have a suspicion that something about Knights Landing is influencing other steps in the assembly as well, as some parts run much slower than the regular compute nodes despite restricting the number of cores (in particular step 5 takes ~10 hours on KNL with this dataset). But I haven't had the time yet to look into it. I expect that KNL will become common over the next few years (they just got phased in on this XSEDE supercomputer)
Deren Eaton
Aug 23 2017 21:05
@nspope, good to know. We'll keep that in mind. Most of our testing has been on HPC systems wth 8-40 core nodes, and where you need to use MPI to connect multiple nodes, in which case threading is still limited to physically connected cores on a single node. But if a single node can connect 272 threads, then yeah, we'll need to limit it otherwise you'll probably never have enough RAM to handle running vsearch since it uses a ton of RAM when it's using many threads.