These are chat archives for dereneaton/ipyrad

May 2017
May 23 2017 12:36

Hi @isaacovercast and @dereneaton, I have something similar to @LilyRobertLewis_twitter

```bash-4.2$ ipyrad -p params-all_ind_2017.txt -s34567 -c20

ipyrad [v.0.6.20]

Interactive assembly and analysis of RAD-seq data

loading Assembly: all_ind_2017
from saved path: /data/filer-5-2/brassac/Secalinum_NGS/GBS_2/raw_reads/new/all_ind_2017.json
host compute node: [20 cores] on

Step 3: Clustering/Mapping reads
[####################] 100% dereplicating | 0:04:00
[######## ] 41% clustering | 10 days, 2:42:43
This is on 100 individuals. When I check the processes on the cluster I can see that besides the few other people using the cluster (cpu load ~ 80%, mem usuage 96G/252G) I have 4 individuals running simultaneously each one with three processes showing a runtime of 74h (and 236h), 122h (and 260h), 174h (and 285h) and 180h (and 288h).
I also noticed that in the clust folder I have 3 files with 0 KB (last changed days ago). Two concern one individual (htemp and utemp) and the third one concerns an other individual (utemp.sort). Those are not the same individuals as the ones previously mentioned.

This doesn't look normal, does it? Should I just kill it? Any idea were might be the issue(s)?


Isaac Overcast
May 23 2017 15:03
@joqb What does your data look like? Is it paired end? How long are the reads?
Can you post the results of ls -ltr <your_clust_directory> | tail
May 23 2017 15:19
Hi @isaacovercast ,
the data should be 100b single reads.
-bash-4.2$ ls -ltr all_ind_2017_clust_0.85 | tail
-rw-r--r-- 1 brassac AGR-ETX  66838909 May 23 08:28 capense_FB_2015_02_2_1109331_TGCATGA_L003001.htemp
-rw-r--r-- 1 brassac AGR-ETX  90487771 May 23 08:28 capense_FB_2015_02_2_1109331_TGCATGA_L003001.utemp
-rw-r--r-- 1 brassac AGR-ETX 117309440 May 23 17:03 capense_JoB_2013_15C_1109339_CCTCTAG.htemp
-rw-r--r-- 1 brassac AGR-ETX  95289344 May 23 17:07 capense_JoB_2013_07C_1109342_AACGACC.htemp
-rw-r--r-- 1 brassac AGR-ETX 108199936 May 23 17:08 secalinum_JoB_2012_01B_1109345_TGGCAAT.utemp
-rw-r--r-- 1 brassac AGR-ETX 107479040 May 23 17:09 capense_JoB_2013_15C_1109339_CCTCTAG.utemp
-rw-r--r-- 1 brassac AGR-ETX  75956224 May 23 17:10 capense_JoB_2013_07C_1109342_AACGACC.utemp
-rw-r--r-- 1 brassac AGR-ETX 119078912 May 23 17:11 secalinum_JoB_2012_01B_1109345_TGGCAAT.htemp
-rw-r--r-- 1 brassac AGR-ETX 133038080 May 23 17:13 secalinum_JoB2012_005A_859441_ACTATCA_L005001.htemp
-rw-r--r-- 1 brassac AGR-ETX  91422720 May 23 17:14 secalinum_JoB2012_005A_859441_ACTATCA_L005001.utemp
Isaac Overcast
May 23 2017 17:08
@joqb It looks okay to me. Looks like it's still running. If there are lots of singletons then it makes clustering take longer and here it looks like you've got this case (htemp files hold "seeds" which include singletons). This could be the case if the genome is huge, or if the cut site is too common. Just let it run, it'll be fine.