These are chat archives for dereneaton/ipyrad

13th
Jul 2017
Jenny Archibald
@jenarch
Jul 13 2017 16:03
Hello, I am running ipyrad on 283 accessions, MSG PE data. It's been stuck at 62% on Step 6 for over 6 days (output below). Any suggestions? I already implemented stronger filtering of these data based on your previous suggestions.
-------------------------------------------------------------
  ipyrad [v.0.7.1]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  New Assembly: m04c90
  host compute node: [24 cores] on n480
  host compute node: [4 cores] on n477
  host compute node: [20 cores] on n476

  Step 1: Loading sorted fastq data to Samples
  [####################] 100%  loading reads         | 0:04:57
  564 fastq files loaded to 282 Samples.

  Step 2: Filtering reads
  [####################] 100%  processing reads      | 0:17:04

  Step 3: Clustering/Mapping reads
  [####################] 100%  dereplicating         | 0:36:57
  [####################] 100%  clustering            | 1 day, 0:00:11
  [####################] 100%  building clusters     | 0:02:21
  [####################] 100%  chunking              | 0:00:17
  [####################] 100%  aligning              | 2:14:40
  [####################] 100%  concatenating         | 0:02:45

  Step 4: Joint estimation of error rate and heterozygosity
  [####################] 100%  inferring [H, E]      | 0:08:24

  Step 5: Consensus base calling
  Mean error  [0.00516 sd=0.00210]
  Mean hetero [0.01381 sd=0.00458]
  [####################] 100%  calculating depths    | 0:01:08
  [####################] 100%  chunking clusters     | 0:02:29
  [####################] 100%  consens calling       | 1:07:15

  Step 6: Clustering at 0.9 similarity across 282 samples
  [####################] 100%  concat/shuffle input  | 0:02:27
  [############        ]  62%  clustering across     | 6 days, 14:02:52
Deren Eaton
@dereneaton
Jul 13 2017 19:55
Hi @jenarch, when the clustering step goes really slow like that its usually bc there is not enough RAM. Can you share your job submission script?
Jenny Archibald
@jenarch
Jul 13 2017 20:54
Yes, thanks for looking into this - I requested procs=48,pmem=5gb.
Deren Eaton
@dereneaton
Jul 13 2017 22:16
@jenarch Has it stopped moving from 62% for many days? Or has it slowly progressed up to 62%? Clustering is done using the software vsearch, which runs multi-threaded on a single node. There isn't much we can do to make it run any faster, but it's usually quite fast even for very large data sets. What affects its run times most is the (1) the number of threads available; (2) the total number of consens reads that pass filtering for each sample; and (3) whether you have enough RAM to store these reads in memory during clustering. We assign the clustering job randomly to one of the nodes that is available (since it can only run on a single node), assuming that they will all be equal in size. Thus, there is a chance that your clustering job was started on the node that has only 4 cores available, and 20GB RAM (5gb per CPU). I don't know for sure that this is the case, however, and there isn't really any easy way to check unless you can run top on the job interactively. Still, though, 282 samples with PE data is a lot of data, so it's possible that the job is running on one of the big nodes and just taking a very long time anyways. The next version of ipyrad (0.7.2) will allow for checkpointing within step 6, which will help make sure you can get through the step without hitting a walltime limit.