These are chat archives for dereneaton/ipyrad

29th
Aug 2017
joqb
@joqb
Aug 29 2017 13:15

Hi @dereneaton, @isaacovercast, running ipyrad 0.7.11 I'm being stuck at step 6. Substeps clustering across and building clusters run and finish without issues but then I get an IOError. The drive I'm working on has 2.5 TB free.

bash-4.2$ ipyrad -p params-all_2017.txt -s 67 -c 20 -t 4 -d

  ** Enabling debug mode **

 -------------------------------------------------------------
  ipyrad [v.0.7.11]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  loading Assembly: all_2017
  from saved path: /filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new/all_2017.json
  establishing parallel connection:
  host compute node: [20 cores] on qg-10.ipk-gatersleben.de

  Step 6: Clustering at 0.85 similarity across 99 samples
  Continuing from checkpoint 6.2
  [####################] 100%  clustering across     | 1 day, 4:31:22
  [####################] 100%  building clusters     | 0:01:21

  Encountered an unexpected error (see ./ipyrad_log.txt)
  Error message is below -------------------------------
IOError([Errno 2] No such file or directory: '/filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new/all_2017_across/all_2017-tmpalign/all_2017.chunk_5602')

Any idea what's wrong?

Isaac Overcast
@isaacovercast
Aug 29 2017 15:02
@joqb Well, that definitely looks a lot like a disk space issue. That's exactly what happens when a job runs out of disk. Are you looking at results of df -h to see how much free space is left?
Also, you might try watching the disk space as the process runs. It would be almost inconceivable that ipyrad would generate 2.5TB worth of temporary files, i can't imagine how that would happen, but yeah this really looks like a disk issue.
joqb
@joqb
Aug 29 2017 15:11

@isaacovercast Yeah I know that's why I mentioned it...

bash-4.2$ df -h
Filesystem                                      Size  Used Avail Use% Mounted on
filer.ipk-gatersleben.de:/transfer              4.T  1.T  2.T  38% /filer/transfer

Does this specific step generate large files?

Does it write files only where it said it looked for it or could it be cached in another temp folder?
If that helps here is the debug:
ipyrad [v.0.7.11]

Interactive assembly and analysis of RAD-seq data

Begin run: 2017-08-28 09:27
Using args {'preview': False, 'force': False, 'threads': 4, 'results': False, 'quiet': False, 'merge': None, 'ipcluster': None, 'cores': 20, 'params': 'params-all_2017.txt', 'branch': None, 'steps': '67', 'debug': True, 'new': None, 'MPI': False}
Platform info: ('Linux', 'qg-10.ipk-gatersleben.de', '3.10.0-514.26.2.el7.x86_64', '#1 SMP Tue Jul 4 15:04:05 UTC 2017', 'x86_64')2017-08-28 09:27:58,655 pid=98026 [load.py] DEBUG skipping: no svd results present in old assembly
2017-08-28 09:27:59,209 pid=98026 [parallel.py] INFO ['ipcluster', 'start', '--daemonize', '--cluster-id=ipyrad-cli-98026', '--engines=Local', '--profile=default', '--n=20']
2017-08-28 09:28:09,438 pid=98026 [cluster_across.py] INFO checkpoint = 2
2017-08-28 09:28:09,439 pid=98026 [cluster_across.py] INFO substeps = [2, 3, 4, 5, 6, 7]
2017-08-28 09:28:10,986 pid=98787 [cluster_across.py] INFO ['/home/brassac/miniconda2/lib/python2.7/site-packages/bin/vsearch-linux-x86_64', '-cluster_smallmem', '/filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new/all_2017_across/all_2017_catshuf.tmp', '-strand', 'plus', '-query_cov', '0.75', '-minsl', '0.5', '-id', '0.85', '-userout', '/filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new/all_2017_across/all_2017.utemp', '-notmatched', '/filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new/all_2017_across/all_2017.htemp', '-userfields', 'query+target+qstrand', '-maxaccepts', '1', '-maxrejects', '0', '-fasta_width', '0', '-threads', '0', '-fulldp', '-usersort', '-log', '/filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new/all_2017_across/s6_cluster_stats.txt']
2017-08-29 13:59:27,244 pid=98787 [cluster_across.py] INFO ended vsearch tracking loop
2017-08-29 13:59:52,585 pid=98794 [cluster_across.py] INFO loading full _catcons file into memory
2017-08-29 14:00:53,053 pid=98794 [cluster_across.py] INFO building clustbits, optim=5602, nseeds=447686, cpus=20
2017-08-29 14:00:54,184 pid=98026 [assembly.py] ERROR IOError([Errno 2] No such file or directory: '/filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new/all_2017_across/all_2017-tmpalign/all_2017.chunk_5602')
2017-08-29 14:00:55,315 pid=98026 [assembly.py] INFO pids [(0, 98739), (1, 98737), (2, 98738), (3, 98740), (4, 98744), (5, 98747), (6, 98752), (7, 98751), (8, 98757), (9, 98761), (10, 98771), (11, 98775), (12, 98768), (13, 98787), (14, 98789), (15, 98794), (16, 98790), (17, 98833), (18, 98815), (19, 98841)]
2017-08-29 14:00:55,324 pid=98026 [assembly.py] INFO queue {0: {u'queue': 0, u'completed': 2, u'tasks': 0}, 1: {u'queue': 0, u'completed': 2, u'tasks': 0}, 2: {u'queue': 0, u'completed': 2, u'tasks': 0}, 3: {u'queue': 0, u'completed': 2, u'tasks': 0}, 4: {u'queue': 0, u'completed': 2, u'tasks': 0}, 5: {u'queue': 0, u'completed': 2, u'tasks': 0}, 6: {u'queue': 0, u'completed': 2, u'tasks': 0}, 7: {u'queue': 0, u'completed': 2, u'tasks': 0}, 8: {u'queue': 0, u'completed': 3, u'tasks': 0}, 9: {u'queue': 0, u'completed': 2, u'tasks': 0}, 10: {u'queue': 0, u'completed': 2, u'tasks': 0}, 11: {u'queue': 0, u'completed': 2, u'tasks': 0}, 12: {u'queue': 0, u'completed': 2, u'tasks': 0}, 13: {u'queue': 0, u'completed': 3, u'tasks': 0}, 14: {u'queue': 0, u'completed': 2, u'tasks': 0}, 15: {u'queue': 0, u'completed': 3, u'tasks': 0}, 16: {u'queue': 0, u'completed': 2, u'tasks': 0}, 17: {u'queue': 0, u'completed': 2, u'tasks': 0}, 18: {u'queue': 0, u'completed': 2, u'tasks': 0}, u'unassigned': 0, 19: {u'queue': 0, u'completed': 2, u'tasks': 0}}
2017-08-29 14:00:56,415 pid=98026 [assembly.py] INFO queue {0: {u'queue': 0, u'completed': 2, u'tasks': 0}, 1: {u'queue': 0, u'completed': 2, u'tasks': 0}, 2: {u'queue': 0, u'completed': 2, u'tasks': 0}, 3: {u'queue': 0, u'completed': 2, u'tasks': 0}, 4: {u'queue': 0, u'completed':

Isaac Overcast
@isaacovercast
Aug 29 2017 15:27
I don't think it's disk. Here's looping and creating a bunch of temp files, but it's crashing trying to open the first file:
                            with open(os.path.join(data.tmpdir,
                                data.name+".chunk_{}".format(loci)), 'w') as clustsout:
Could it be that you don't have write access to this directory?
can you mkdir wat in this directory: /filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new/all_2017_across
Can i see the output of ls -l /filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new/all_2017_across
joqb
@joqb
Aug 29 2017 15:31
bash-4.2$ cd /filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new/all_2017_across
bash-4.2$ mkdir wat
bash-4.2$ ls -l /filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new/all_2017_across
total 4215356
-rwx------ 1 brassac AGR-ETX  269772974 Aug 25 17:35 all_2017_catcons.tmp
-rwx------ 1 brassac AGR-ETX 1025856322 Aug 25 17:36 all_2017_cathaps.tmp
-rwx------ 1 brassac AGR-ETX 1025856322 Aug 25 17:39 all_2017_catshuf.tmp
-rwx------ 1 brassac AGR-ETX 1025856322 Aug 25 17:38 all_2017_catsort.tmp
-rwx------ 1 brassac AGR-ETX  109694144 Aug 29 13:59 all_2017.htemp
-rwx------ 1 brassac AGR-ETX  421250452 Aug 29 13:59 all_2017.utemp
-rwx------ 1 brassac AGR-ETX  421250452 Aug 26 22:11 all_2017.utemp.sort
-rwx------ 1 brassac AGR-ETX       1143 Aug 29 13:59 s6_cluster_stats.txt
drwx------ 2 brassac AGR-ETX       4096 Aug 29 17:30 wat
Isaac Overcast
@isaacovercast
Aug 29 2017 15:49
It looks like the tmp directory isn't getting created properly: /filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new/all_2017_across/all_2017-tmpalign
joqb
@joqb
Aug 29 2017 15:56
But I was able to mkdir wat without issues or did I miss something?
Isaac Overcast
@isaacovercast
Aug 29 2017 15:58
No, yeah that worked, which is why i'm a little stumped. Can i see ls -l /filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new
This is on a cluster of some kind?
joqb
@joqb
Aug 29 2017 16:01
I don't exactly know how to describe the architecture...
bash-4.2$ ls -l  /filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new/
total 11068
drwx------ 3 brassac AGR-ETX    4096 Aug 29 17:30 all_2017_across
drwx------ 2 brassac AGR-ETX  184320 Aug 25 17:34 all_2017_clust_0.85
drwx------ 2 brassac AGR-ETX  290816 Aug 25 17:34 all_2017_consens
-rwx------ 1 brassac AGR-ETX  473976 Aug 29 14:00 all_2017.json
-rwx------ 1 brassac AGR-ETX  348149 Aug 25 15:41 all_ind_2017_bis.json
-rwx------ 1 brassac AGR-ETX    3599 Aug 25 15:41 all_ind_2017_bis_s1_demultiplex_stats.txt
-rwx------ 1 brassac AGR-ETX       0 Jul 13 10:43 all_ind_2017.json
-rwx------ 1 brassac AGR-ETX    5899 Aug 25 15:41 all_ind_2017_s1_demultiplex_stats.txt
-rwx------ 1 brassac AGR-ETX    3930 Aug 25 15:41 all_inds_spc_params.txt
-rwx------ 1 brassac AGR-ETX   61058 Aug 25 15:41 asian_marinum.json
-rwx------ 1 brassac AGR-ETX     167 Aug 25 15:41 asian-pop.txt
-rwx------ 1 brassac AGR-ETX  348729 Aug 25 15:41 asians.json
-rwx------ 1 brassac AGR-ETX     295 Aug 25 15:41 asians_marinum.txt
-rwx------ 1 brassac AGR-ETX    3599 Aug 25 15:41 asians_s1_demultiplex_stats.txt
-rwx------ 1 brassac AGR-ETX     170 Aug 25 15:41 asians.txt
-rwx------ 1 brassac AGR-ETX  165573 Aug 25 15:41 capense.json
-rwx------ 1 brassac AGR-ETX     825 Aug 25 15:41 capense.txt
-rwx------ 1 brassac AGR-ETX  129082 Aug 25 15:41 gmon.out
-rwx------ 1 brassac AGR-ETX    3634 Aug 25 15:41 guss_params.txt
-rwx------ 1 brassac AGR-ETX 8877591 Aug 29 14:00 ipyrad_log.txt
-rwx------ 1 brassac AGR-ETX   50119 Aug 25 15:41 marinum.json
-rwx------ 1 brassac AGR-ETX    3016 Aug 25 16:01 params-all_2017.txt
-rwx------ 1 brassac AGR-ETX    3015 Aug 25 15:41 params-all_ind_2017.txt
-rwx------ 1 brassac AGR-ETX    3073 Aug 25 15:41 params-asian_marinum.txt
-rwx------ 1 brassac AGR-ETX    3090 Aug 25 15:41 params-asians.txt
-rwx------ 1 brassac AGR-ETX    3073 Aug 25 15:41 params-capense.txt
-rwx------ 1 brassac AGR-ETX    3076 Aug 25 15:41 params-marinum.txt
-rwx------ 1 brassac AGR-ETX    3073 Aug 25 15:41 params-secalinum.txt
-rwx------ 1 brassac AGR-ETX    3008 Aug 25 15:41 params-wtf.txt
-rwx------ 1 brassac AGR-ETX    3779 Aug 25 15:41 popfile_ipyrad.txt
-rwx------ 1 brassac AGR-ETX  218401 Aug 25 15:41 secalinum.json
-rwx------ 1 brassac AGR-ETX    1165 Aug 25 15:41 secalinum.txt
Isaac Overcast
@isaacovercast
Aug 29 2017 16:13
That all looks fine (assuming it's not running as some weird user on whatever kind of platform you're on, but if it was i'd assume it would have broken much earlier).... Well, this is very hackish, but you could try mkdir /filer/transfer/brassac/Secalinum_NGS/GBS_2/raw_reads/new/all_2017_across/all_2017-tmpalign and then re-running it. I kind of don't want that to actually fix the problem, but it'll give us good information if it doesn't crash. I'm looking
mdrphd
@mdrphd
Aug 29 2017 20:15

Looking for some help with demultiplexing step with technical replicates. I have several libraries where 96 samples each have four different barcodes in the library. When I run step 1 in ipyrad and look at the s1_demultiplex_stats.txt file, only one of the samples was recognized as a technical rep with read numbers for each barcode. None of the other 95 samples have a -technical-replicate-n designation, and only ~25% of the reads in the library are matched to barcodes. Also I have read counts for only 99 barcodes out of the 384 that are in my barcodes file. It seems that my samples are not being combined over barcodes and ipyrad is matching one barcode per sample and missing the other three barcodes per sample. Interestingly, this is happening in all 11 libraries I have . I have tried to change the barcode file to space-delimited and tab-delimited, which didn't make a difference. I also sorted the barcode file based on sample name, which gave different read counts per sample because a different barcode was matched to the reads. I am not sure what I am doing wrong, but any suggestions would be appreciated. Are there specific requirements for sample names? I could do a workaround where I have a unique name for each sample (e.g., sample1a, sample1b, sample1c, sample1d) then combine the demultiplexed fastq files that belong to the same sample, but I we often have libraries with multiple barcodes per sample, so I would like to utilize the technical replicates feature of ipyrad, if possible. To illustrate, here is part of a barcode file:

N1-139-2 AACCTA
N1-139-2 CATCACAAG
N1-139-2 CTCTCCAG
N1-139-2 TGACGCCA
N1-139-4 CAGATA
N1-139-4 GATCAT
N1-139-4 TAATTG
N1-139-4 TCCAG
N1-139-7 AACTGAAG
N1-139-7 ATCTCGT
N1-139-7 CCAGGCAACA
N1-139-7 GAAGTG

And here is part of the s1_demultiplex_stats.txt file:

sample_name                               true_bar       obs_bar     N_records
N1-139-2                                  TGACGCCA      TGACGCCA        666503
N1-139-4                                     TCCAG         TCCAG        669865
N1-139-7                                    GAAGTG        GAAGTG        481372
N1-149-10                                 TGGACACT      TGGACACT        460965
N1-149-3                                   TGATAAT       TGATAAT        567875
N1-149-7                                 TATTCGCAT     TATTCGCAT        566708
N1-152-3                                 TTGCACCAG     TTGCACCAG        488209
N1-152-4                                   TTGCGCT       TTGCGCT        879377
N1-152-8                                    GGCTTA        GGCTTA        653796
N1-156-3                                    TGTGGA        TGTGGA        719520
N1-156-5                                    TTCACG        TTCACG        764681
N1-156-8                                  GACACACT      GACACACT        822448
N3A-123-3                                   GGTATA        GGTATA        884166
N3A-123-6                                    TTGAA         TTGAA        643748
N3A-123-7-technical-replicate-1           CTATCACT      CTATCACT        626773
N3A-123-7-technical-replicate-2              CTCGG         CTCGG        484438
N3A-123-7-technical-replicate-3              GGTGT         GGTGT        754405
N3A-123-7-technical-replicate-4              GTGTT         GTGTT        706834