These are chat archives for dereneaton/ipyrad

22nd
Mar 2017
draheem
@draheem
Mar 22 2017 15:29
I am analysing single-end RAdseq data (250 bp reads, R1). 1) Given the length of the reads is it advisable to increase the settings above the default values for parameters 22 (max_SNPs_locus) and 23 (max_indels_locus)?
Isaac Overcast
@isaacovercast
Mar 22 2017 16:27
@ewarschefsky_twitter RE: clustering time for reference assembly, actually denovo is slightly faster. Reference mapping should be more accurate, but there is a little bit of housekeeping associated with that part of the pipeline that denovo doesn't go through...
@draheem It kind of depends on your data. How many samples do you have? Is it population level sampling or more phylogenetic scale? Do you have an idea of nucleotide diversity? For long reads I would definitely recommend checking a sample or 2 in fastQC to make sure the quality is okay, especially toward the 3' end.
R2C2.lab
@R2C2_Lab_twitter
Mar 22 2017 17:17

@isaacovercast I used cutadapt to trim the raw reads and clean the adapters and I am having this error: 2017-03-22 17:13:28,019 pid=13170 [util.py] ERROR Exception in merge_pairs - ('Error in merge pairs:\n %s\n%s', ['/home/regina/miniconda2/lib/python2.7/site-packages/bin/vsearch-linux-x86_64', '--fastq_mergepairs', '/home/regina/19samples/19samples-denovo-symb-cutadapt_edits/L47-tmp-umap1.fastq', '--reverse', '/home/regina/19samples/19samples-denovo-symb-cutadapt_edits/L47-tmp-umap2.fastq', '--fastqout', '/home/regina/19samples/19samples-denovo-symb-cutadapt_edits/L47-refmap_derep.fastq', '--fastqout_notmerged_fwd', '/home/regina/19samples/19samples-denovo-symb-cutadapt_edits/tmpCCJmq2_nonmergedR1.fastq', '--fastqout_notmerged_rev', '/home/regina/19samples/19samples-denovo-symb-cutadapt_edits/tmptsXjYK_nonmergedR2.fastq', '--fasta_width', '0', '--fastq_minmergelen', '35', '--fastq_maxns', '5', '--fastq_minovlen', '20', '--fastq_maxdiffs', '4', '--label_suffix', '_m1', '--fastq_qmax', '1000', '--threads', '2', '--fastq_allowmergestagger'], 'vsearch v2.0.3_linux_x86_64, 62.9GB RAM, 20 cores\nhttps://github.com/torognes/vsearch\n\nMerging reads\n\nFatal error: More forward reads than reverse reads\n')
2017-03-22 17:15:13,147 pid=13075 [util.py] ERROR Error: ['/home/regina/miniconda2/lib/python2.7/site-packages/bin/vsearch-linux-x86_64', '--fastq_mergepairs', '/home/regina/19samples/19samples-denovo-symb-cutadapt_edits/MC-O1-tmp-umap1.fastq', '--reverse', '/home/regina/19samples/19samples-denovo-symb-cutadapt_edits/MC-O1-tmp-umap2.fastq', '--fastqout', '/home/regina/19samples/19samples-denovo-symb-cutadapt_edits/MC-O1-refmap_derep.fastq', '--fastqout_notmerged_fwd', '/home/regina/19samples/19samples-denovo-symb-cutadapt_edits/tmpifQXIc_nonmergedR1.fastq', '--fastqout_notmerged_rev', '/home/regina/19samples/19samples-denovo-symb-cutadapt_edits/tmp7pT1nh_nonmergedR2.fastq', '--fasta_width', '0', '--fastq_minmergelen', '35', '--fastq_maxns', '5', '--fastq_minovlen', '20', '--fastq_maxdiffs', '4', '--label_suffix', '_m1', '--fastq_qmax', '1000', '--threads', '2', '--fastq_allowmergestagger'] vsearch v2.0.3_linux_x86_64, 62.9GB RAM, 20 cores
https://github.com/torognes/vsearch

Merging reads

Fatal error: More forward reads than reverse reads

is there any problem with cutadapt? Usually I use TrimGalore and never had problems but tried cutadapt and seemed to be more efficient for ezRAD PE reads
and just updated to ipyrad [v.0.6.10]
It is still doing the mapping but the log file shows the above error

regina@ccmar-r2c2-01:~/19samples⟫ ipyrad -p params-denovo-reference.txt -s 3 -c 16


ipyrad [v.0.6.10]

Interactive assembly and analysis of RAD-seq data

loading Assembly: 19samples-denovo-symb-cutadapt
from saved path: ~/19samples/19samples-denovo-symb-cutadapt.json
host compute node: [16 cores] on ccmar-r2c2-01

Step 3: Clustering/Mapping reads

*************************************************************
Indexing reference sequence with bwa. 
This only needs to be done once, and takes just a few minutes
************************************************************* 
Done indexing reference sequence

[####################] 100% dereplicating | 0:06:16
[################ ] 84% mapping | 0:39:04
u

Jenny Archibald
@jenarch
Mar 22 2017 18:44

@isaacovercast @dereneaton I have not managed to get ipyrad to work yet for my paired RADseq dataset. We've fixed several previous issues (with your help), now it seems to be stalling on step 6. It has hit our cluster's time limit a couple times now without seeming to make any further progress. Here is the latest log:


ipyrad [v.0.6.10]

Interactive assembly and analysis of RAD-seq data

loading Assembly: m04c90
from saved path: /panfs/pfs.acf.ku.edu/scratch/jkarch/cam/ch1/m04c90.json
host compute node: [8 cores] on m008

Step 6: Clustering at 0.9 similarity across 288 samples
[####################] 100% concat/shuffle input | 0:00:04
[ ] 0% clustering across | 7 days, 0:00:05 =>> PBS: job killed: walltime 604823 exceeded limit 604800

We ran a preliminary analysis with about half of these individuals last year in (no i)pyrad and it worked fine, so the data seem to be ok. I'd like to use ipyrad instead if possible, because of the improvements for dealing with paired end data. Do you know what the problem might be? This analysis has been continued multiple times with different versions of ipyrad (as noted above, there were previous issues to fix and walltime limits), so I did restart another analysis fresh from step 1 with the current version in case that was the problem. However, I am not sure how many weeks behind it is and so would prefer to just move forward if possible. Any advice is appreciated!

Isaac Overcast
@isaacovercast
Mar 22 2017 18:58
@R2C2_Lab_twitter The problem isn't with cutadapt, it's internal to reference sequence mapping. Did step 3 finish? I thought I had it set up to catch these errors and just ignore them.
@R2C2_Lab_twitter So the assembly actually crashes?
Isaac Overcast
@isaacovercast
Mar 22 2017 21:47
@jenarch I'd be willing to bet its still a simple walltime issue. If there's any way you can get more than 8 cores that'll probably help the cause considerably. 288 samples is a considerable number, and if they are PE that increases the size of the dataset. I know a guy who is running 477 samples on 8 cores and step 6 ran for 2 weeks! In general more cores and more RAM will always help. If your cluster has MPI capability you might try requesting several compute nodes and use the --MPI flag for ipyrad: