These are chat archives for dereneaton/ipyrad

29th
Aug 2016
Zac Forsman
@zacforsman
Aug 29 2016 00:04
Hey folks, I keep getting errors when trying to do reference assemblies against transcriptomic contigs that are concatenated together with a padding of a few hundred NNNNNs inbetween transcripts. It seems to chug along fine then I get a lot of errors at the end of step 3, which takes a few days... this is what the errors look like: >
Step 3: Clustering/Mapping reads
Reference sequence index exists
[####################] 100% dereplicating | 0:01:14
[####################] 100% mapping | 20:28:08
[####################] 100% finalize mapping | 0:00:00
All samples failed ref_muscle_chunker - All samples failed:
IPyradError( Error in merging pairs:
(Command '/home/z/anaconda/lib/python2.7/site-packages/bin/vsearch-linux-x86_64 --fastq_mergepairs /home/z/Desktop/Poritesbleached/trans_refmapping/113_S4PlobTranscriptPseudoref1399214073-R1 --reverse /home/z/Desktop/Poritesbleached/trans_refmapping/113_S4PlobTranscriptPseudoref1399214073-R2 --fastqout /home/z/Desktop/Poritesbleached/trans_refmapping/113_S4PlobTranscriptPseudoref1399214073-merged --fastqout_notmerged_fwd /home/z/Desktop/Poritesbleached/trans_edits/tmpco75la_nonmergedR1.fastq --fastqout_notmerged_rev /home/z/Desktop/Poritesbleached/trans_edits/tmpd0wyFz_nonmergedR2.fastq --fasta_width 0 --fastq_allowmergestagger --fastq_minmergelen 35 --fastq_maxns 5 --fastq_minovlen 20 --fastq_maxdiffs 4 --label_suffix _m1 --threads 0' returned non-zero exit status 1).)
IPyradError( Error in merging pairs:
(Command '/home/z/anaconda/lib/python2.7/site-packages/bin/vsearch-linux-x86_64 --fastq_mergepairs /home/z/Desktop/Poritesbleached/trans_refmapping/85_S1PlobTranscriptPseudoref125059125108-R1 --reverse /home/z/Desktop/Poritesbleached/trans_refmapping/85_S1PlobTranscriptPseudoref125059125108-R2 --fastqout /home/z/Desktop/Poritesbleached/trans_refmapping/85_S1PlobTranscriptPseudoref125059125108-merged --fastqout_notmerged_fwd /home/z/Desktop/Poritesbleached/trans_edits/tmp62fFSN_nonmergedR1.fastq --fastqout_notmerged_rev /home/z/Desktop/Poritesbleached/trans_edits/tmpqLvSc4_nonmergedR2.fastq --fasta_width 0 --fastq_allowmergestagger --fastq_minmergelen 35 --fastq_maxns 5 --fastq_minovlen 20 --fastq_maxdiffs 4 --label_suffix _m1 --threads 0' returned non-zero exit status 1).)
IPyradError( Error in merging pairs:
(Command '/home/z/anaconda/lib/python2.7/site-packages/bin/vsearch-linux-x86_64 --fastq_mergepairs /home/z/Desktop/Poritesbleached/trans_refmapping/95A_S5PlobTranscriptPseudoref135992136033-R1 --reverse /home/z/Desktop/Poritesbleached/trans_refmapping/95A_S5PlobTranscriptPseudoref135992136033-R2 --fastqout /home/z/Desktop/Poritesbleached/trans_refmapping/95A_S5PlobTranscriptPseudoref135992136033-merged --fastqout_notmerged_fwd /home/z/Desktop/Poritesbleached/trans_edits/tmp4AE8wi_nonmergedR1.fastq --fastqout_notmerged_rev /home/z/Desktop/Poritesbleached/trans_edits/tmpwdua4B_nonmergedR2.fastq --fasta_width 0 --fastq_allowmergestagger --fastq_minmergelen 35 --fastq_maxns 5 --fastq_minovlen 20 --fastq_maxdiffs 4 --label_suffix _m1 --threads 0' returned non-zero exit status 1).)
IPyradError( Error in merging pairs:
(Command '/home/z/anaconda/lib/python2.7/site-packages/bin/vsearch-linux-x86_64 --fastq_mergepairs /home/z/Desktop/Poritesbleached/trans_refmapping/67_S8PlobTranscriptPseudoref16781728-R1 --reverse /home/z/Desktop/Poritesbleached/trans_refmapping/67_S8PlobTranscriptPseudoref16781728-R2 --fastqout /home/z/Desktop/Poritesbleached/trans_refmapping/67_S8PlobTranscriptPseudoref16781728-merged --fastqout_notmerged_fwd /home/z/Desktop/Poritesbleached/trans_edits/tmpk3JVEt_nonmergedR1.fastq --fastqout_notmerged_rev /home/z/Desktop/Poritesbleached/trans_edits/tmpiXvmQL_nonmergedR2.fastq --fasta_width 0 --fastq_allowmergestagger --fastq_minmergelen 35 --fastq_maxns 5 --fastq_minovlen 20 --fastq_maxdiffs 4 --label_suffix _m1 --threads 0' returned non-zero exit status 1).)
IPyradError( Error in merging pairs:
(Command '/home/z/anaconda/lib/python2.7/site-packages/bin/vsearch-linux-x86_64 --fastq_mergepairs /home/z/Desktop/Poritesbleached/trans_refmapping/96_S7PlobTranscriptPseudoref3513735231-R1 --reverse /home/z/Desktop/Poritesbleached/trans_refmapping/96_S7PlobTranscriptPseudoref3513735231-R2 --fastqout /home/z/Desktop/Poritesbleached/trans_refmapping/96_S7PlobTranscriptPseudoref3513735231-merged --fastqou
Zac Forsman
@zacforsman
Aug 29 2016 00:11
Looking more closely at the log file, there were memory allocation errors... and I did run out of disk space the first time I ran this... I re-ran it without deleting the files or using the force flag, so maybe that is the source of the error... I'll retry after removing the files (Start from scratch) and let you know. -Zac
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 13:47
Hello, is it normal for the database building step to be the most time consuming?
 -------------------------------------------------------------
  ipyrad [v.0.3.34]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  New Assembly: dman
  local compute node: [48 cores] on nid00724

  Step 1: Linking sorted fastq data to Samples
    Linking to demultiplexed fastq files in: /scratch/02728/emo347/ddRAD/dmandonii/f_denovo-rad_ipyrad/dman_fastqs/*.gz
    174 new Samples created in 'dman'.
    348 fastq files linked to 174 new Samples.

  Step 2: Filtering reads
  [####################] 100%  processing reads      | 0:46:54

  Step 3: Clustering/Mapping reads
  [####################] 100%  dereplicating         | 0:00:00
  [####################] 100%  clustering            | 0:35:26
  [####################] 100%  chunking              | 0:00:01
  [####################] 100%  aligning              | 6:26:46
  [####################] 100%  concatenating         | 0:02:40

  Step 4: Joint estimation of error rate and heterozygosity
  [####################] 100%  inferring [H, E]      | 0:24:43

  Step 5: Consensus base calling
  Mean error  [0.00116 sd=0.00046]
  Mean hetero [0.01420 sd=0.00200]
  [####################] 100%  consensus calling     | 0:23:07

  Step 6: Clustering across 174 samples at 0.85 similarity
  [####################] 100%  concat/shuffle input  | 0:00:37
  [####################] 100%  clustering across     | 0:04:42
  [####################] 100%  building clusters     | 0:01:25
  [####################] 100%  aligning clusters     | 0:06:30
  [####################] 100%  indexing clusters     | 0:17:04
  [#######             ]  37%  building database     | 10:50:14
Deren Eaton
@dereneaton
Aug 29 2016 13:49
currently yes. It is not parallelized very efficiently yet. Another thing on the TODO list.
Did you get that one sample to work with the assembly?
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 13:51
Nope, this is a different dataset, did it fail in your computer as well?
Deren Eaton
@dereneaton
Aug 29 2016 13:51
I'm not sure why it got excluded before, it seemed to pass through step3 fine when I tested it.
I'll think about it some more.
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 13:53
Thanks! Do you think it has to do with the -c flag, I am using 8. Should I reduce it?
Deren Eaton
@dereneaton
Aug 29 2016 13:56
It's possible that it was caused by a memory error
but I think that would have been printed into the log file.
The file was about 2Gb I think, and so vsearch needs at least that much RAM. And if it was doing 7 other individuals at the same time then it needs at least ~10Gb RAM or more. That's not really that much, so I would assume it wouldn't be a problem.
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 14:02
Yes, and telling from the files the one that is not getting anywhere is the .clust.gz. I will try again with less cores and see what happens...
Deren Eaton
@dereneaton
Aug 29 2016 14:09
you should be able to run step3 again (without the force flag) and it will skip over all of the samples with state>3 and only run the remaining sample that did not finish clustering.
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 14:11
Good to know!, I was going to start from scratch
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 15:54
The sample got processed when I just analyzed R1s. Now I am running step3 on the pairs using only 2 cores (it doesn't let you use just 1, right?)... fingers crossed
Isaac Overcast
@isaacovercast
Aug 29 2016 18:20
@edgardomortiz you can totally use just one core, I do it all the time when i'm troubleshooting parallelization issues.
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 18:26
Ahh OK, so I just have to submit the job without the ipcluster setup, right?
Isaac Overcast
@isaacovercast
Aug 29 2016 18:27
Ah, no i don't think that'll work, you run the ipcluster, but then give it the -c 1
Deren Eaton
@dereneaton
Aug 29 2016 18:27
you'll have to do ipcluster start --n=1 --profile=ipyradfor it to use one engine to process jobs
@isaacovercast he's on the Texas cluster that doesn't work with the autolaunching ipyparallel right now.
Isaac Overcast
@isaacovercast
Aug 29 2016 18:28
ahhhh, yes i see
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 18:28
I did that (ipcluster start --n=1 --profile=ipyrad) and it responded that I can't start a job with no engines, I will try again
Deren Eaton
@dereneaton
Aug 29 2016 18:29
maybe you entered n=0?
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 18:31
I haven't been using = after the --n, I hope that is the issue
Deren Eaton
@dereneaton
Aug 29 2016 18:32
I don't think it's necessarily required, but I guess it's better to be programmatic about it.
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 18:36
Well, I will --n=1 next time, right now is running with 2 cores, it seems that the sample finished clustering but the .clust.gz was not populated, because right now is working in two different samples
-rw------- 1 emo347 779M Aug 29 13:08 ast_90_clust_0.9/DIP-Dmeye.utemp
-rw------- 1 emo347  12M Aug 29 13:08 ast_90_clust_0.9/DIP-Dmeye.htemp
-rw------- 1 emo347    0 Aug 29 13:08 ast_90_clust_0.9/DIP-Dmeye.clust.gz
Deren Eaton
@dereneaton
Aug 29 2016 18:37
give it a minute...
@isaacovercast it seems like if build_clustersfailed it should definitely raise an error, any ideas?
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 18:43
 -------------------------------------------------------------
  ipyrad [v.0.3.34]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  loading Assembly: ast_90
  from saved path: /scratch/02728/emo347/ddRAD/astereae/e_denovo-rad_ipyrad/ast_90.json
  local compute node: [48 cores] on nid00252

  Step 3: Clustering/Mapping reads
  [####################] 100%  dereplicating         | 0:10:06
  [####################] 100%  clustering            | 2:56:34
  Samples failed this step:DIP-Dmeye
  Samples failed this step - ['DIP-Dmeye']
  [####################] 100%  chunking              | 0:05:17

I will leave it aligning the rest. Then I will rerun step 3, next time it should only process the one that failed and skip the rest...

The .utemp file for that sample is 7x larger than the next:

-rw------- 1 emo347 101M Aug 29 10:52 PIO-Derio.utemp
-rw------- 1 emo347 103M Aug 29 13:20 AS2-Halie.utemp
-rw------- 1 emo347 107M Aug 29 12:24 HYB-Dspc2.utemp
-rw------- 1 emo347 109M Aug 29 11:51 AS2-Pquad.utemp
-rw------- 1 emo347 110M Aug 29 13:38 PIO-Dcaya.utemp
-rw------- 1 emo347 110M Aug 29 11:07 DIP-Dgood.utemp
-rw------- 1 emo347 111M Aug 29 12:42 AS2-Btric.utemp
-rw------- 1 emo347 119M Aug 29 11:25 DIP-Deric.utemp
-rw------- 1 emo347 132M Aug 29 13:07 PIO-Dalve.utemp
-rw------- 1 emo347 779M Aug 29 13:08 DIP-Dmeye.utemp
Deren Eaton
@dereneaton
Aug 29 2016 18:49
that means it has 7X as many reads that clustered to another read. Probably just because it had so much more input data. I'm doing some testing on the sample now.
Deren Eaton
@dereneaton
Aug 29 2016 18:57
In [24]: build_clusters(data, sample)
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-24-934296d37bee> in <module>()
----> 1 build_clusters(data, sample)

<ipython-input-4-07e75b8b520d> in build_clusters(data, sample)
     84     ## and messes up writing the final clustS.gz
     85     if seqslist:
---> 86         clustfile.write("\n//\n//\n".join(seqslist)+"\n")
     87 
     88     ## make Dict. from seeds (_temp files)

/home/deren/miniconda2/lib/python2.7/gzip.pyc in write(self, data)
    239 
    240         if len(data) > 0:
--> 241             self.fileobj.write(self.compress.compress(data))
    242             self.size += len(data)
    243             self.crc = zlib.crc32(data, self.crc) & 0xffffffffL

OverflowError: size does not fit in an int
Looks like we're doing this step really inefficiently, and it's crashing from the clustfile list becomign too large.
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 18:58
Do you you think that a loop to load the .utemp instead of readlines() would be better?
Deren Eaton
@dereneaton
Aug 29 2016 18:59
or I guess the error is when writing to the gzip file, but still, something having to do with something getting too big.
I guess we never optimized the func b/c we never ran into a sample that it didn't finish in a few seconds. Yeah, we probably need to come up something using iterators or loops here.
Or probably just stick in a print to file and clear list statment somewhere inside the loop
Deren Eaton
@dereneaton
Aug 29 2016 19:12
but the dictionary before that is holding way too much in memory. We could probably substitute this for some kind of indexed array.
well, it works for now with that quick fix:
>DIP-Dmeye_2827102_r1_m1;size=6;*
CATGCCATTTTCTGTTCTGAACAACATTTCTGTTTTTGGTCAATTGCTCTCAATGTTATACATTTTGTTAGGTAAATGCTGACAAGCCAGAATCTTTAACTGGTTTGAGATTAAAAGATGTCGACTGGCCTTCTCCAAGTGATCAAATCCCATTTTGGAAGAGGGNNTT
>DIP-Dmeye_16027633_r1_m1;size=3;+
CATGCCATTTTCTGTTCTGAACAACATTTCTGTTTTTGGTCAATTGCTCTCAATGTTATACATTTTGTTAGGTAAATGCTGACAAGCCAGAATCTTTAACTGGTTTGAGATTAAAAGATGTCGACTGGCCTTCTCCAAGTGATCAAATCCCATTTTGGAAGAGGGANTT
>DIP-Dmeye_19449613_r1_m1;size=2;+
CATGCCATTTTCTGTTCTGAACAACATTTCTGTTTTTGGTCAATTGCTCTCAATGTTATACATTTTGTTAGGTAAATGCTGACAAGCCAGAATCTTTAACTGGTTTGAGATTAAAAGATGTCGACTGGCCTTCTCCAAGTGATCAAATCCCATTTTGGAAGAGGGNATT
>DIP-Dmeye_1063090_r1_m1;size=1;+
CATGCCATTTTCTGTTCTGAACAACATTTCTGTTTTTGGTCAATTGCTCTCAATGTTATACATTTTGTTAGGTAAATGCTGACAAGCCAGAATCTTTAACTGGTTTGAGATTAAAAGATGTCGACTGGCCTTCTCCAAGTGATCAAATCCCATTTTGGAAGAGGGNATN
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 19:18
Great, Will update as soon as the other jobs I am running are fisnihed. Thanks!
Deren Eaton
@dereneaton
Aug 29 2016 19:18
One thing that might help your data set cluster faster Edgar would be to use the edit_cutsites option since it looks like you have a lot of uncalled bases in the second cut site overhang, i.e., the AATT at the far end of the reads. You can deal with this in two ways. One is to trim it off by setting a number, e.g., (4, 4), and other is to edit the sequence, since you know what it's supposed to be for sure, e.g., (CATG, AATT).
It's just a way of reducing the number of unique sequences that need to be clustered, since a bunch will differ by a single N, and often Ns creep into the cut sites, probably because the Illumina software doesn't deal well with having a bunch of invariant sites at the beginning of reads.
That option is applied in step2, so you would have to go back to that step.
I'll push the change in a minute, haven't done it yet.
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 19:23
Thanks, I will try that. So my parameter 25 would look like CATGC, AATTC for SphI and EcoRI, right?
Deren Eaton
@dereneaton
Aug 29 2016 19:26
oh, yeah, that looks right.
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 19:32
Cool, thanks Deren, that is a good way of allowing the Ns to be in sites outside the cutsites (I imagine more reads will pass the quality filter) and I needed to update the version anyways :)
Deren Eaton
@dereneaton
Aug 29 2016 19:35
ok v.0.3.36 is now up.
Edgardo M. Ortiz
@edgardomortiz
Aug 29 2016 21:16
I think something broke. There are no separators between clusters in the .clust.gz files anymore. I was testing with a ddrad single-end dataset though.
Deren Eaton
@dereneaton
Aug 29 2016 21:17
oh shit, guess I should have tested before pushing the change.
I just reverted it back to before the bug for now (v.0.3.36-4). If you or Isaac can find an easy fix go for it, I gotta run for now.