These are chat archives for dereneaton/ipyrad

7th
Jun 2017
dinmatias
@dinmatias
Jun 07 2017 00:08
Hi All, I'm running ipyrad v.0.6.27. Just a question about CPU usage. I ran the ipyrad specifying it to use 40 cores (with the -c 40 argument). I'm monitoring the number of cores ipyrad utilizes and during the step 3: Clustering/Mapping reads, it only uses 20 cores (most of it with vsearch i think). My question is, how can I tweak it so that it can use all the specified cores? In addition, what does the thread (-t) argument specifically do?
Deren Eaton
@dereneaton
Jun 07 2017 15:47
Hi @dinmatias, most steps of ipyrad perform parallelization by multiprocessing, meaning that jobs are split into smaller bits and distributed among all of the available cores. However, some parts of the analysis also use multithreading, where a single function is performed over multiple cores. More complicated, parts like step3 perform several multithreaded jobs in parallel using multiprocessing... you still with me? The -c argument is the total number of cores that are available, while the -t argument allows more fine-tuned control of how the multithreaded functions will be distributed among those cores. For example, the default with 40 cores and -t=2 would be to start 20 2-threaded vsearch jobs. There are some parts of the code that cannot proceed until other parts finish, so at some points the code may run while using fewer than the total number of cores available, which is likely what you are seeing in step 3. Basically, it will not start the aligning step until all of the samples have finished clustering. It's all fairly complicated, but we generally try to keep everything working as efficiently as possible. If you have just one or two samples that are much bigger (have more data) than the rest, and they are taking much longer to cluster, then you may see a speed improvement by increasing the threading argument (e.g., -t 4).
Deren Eaton
@dereneaton
Jun 07 2017 15:56
Hi @jaecan808_twitter, can you provide some more details about how you're running ipyrad? Are you running step 1 when you see this error?
dinmatias
@dinmatias
Jun 07 2017 16:40
Hi @dereneaton , many thanks for the response. I get a bit confused with the threading part. So the -t argument specify the number of processes/threads that a particular task, in this case vsearch, can run in parallel? But rather than running a multithread within a core, it runs the threads in different cores? Is that right? I'm trying to assemble (steps 3 to 7) 100 samples (paired rad reads) roughly more than 200GB in size. I've noticed that during the clustering, a 1 - 2% progress took 9 - 10 hours, so I'm trying to find ways to optimize the use of computing resource at hand (btw, i did aggressive trimming of the reads as suggested in the faqs). I'll try to play with the -t again and see how it works and let you know. thanks!!!
Deren Eaton
@dereneaton
Jun 07 2017 17:08
@dinmatias The progress bar for the clustering part in step3 does not move smoothly, but rather with the number of samples that finish vsearch clustering. So it may stick at 0% for a while and then jump up to 10% quickly if the first 10 or so samples all finish around the same time, so that's something to keep in mind. The most likely thing that would make clustering run very slowly is if you are running out of RAM. Make sure you allocate ~4GB per core. Increasing the threading to a higher value will probably reduce the memory usage as well.