These are chat archives for dereneaton/ipyrad

May 2018
May 04 2018 03:10
Hi! I'm trying to filter ddRAD plastid reads into separate files to analyse these independently to the nuclear data. Thanks for adding the option to assemble to a reference genome in ipyrad; previously I was using my own pipeline including BWA-MEM and feeding the output into pyrad. I'm using a reference plastid genome generated from one of the samples in my data-set. I've manually checked the the final ipyrad output back against a couple of different reference genomes I have and the results are a bit hit and miss; about half of the loci are similar to, but not from, the plastid genome. It looks like the mapping tool used in ipyrad finds a short seed sequence in the reads that match to the reference and calls this, but ignores the similarity/dissimilarity of the remaining read (similar to BWA-MEM). Is there any way to increase the stringency of this step in the ipyrad parameters? I know in Nucmer (part of Mummer) you can specify a proportion sequence similarity across a proportion of the read length when you're mapping to a reference but I haven't been able to figure out how to weave this into ipyrad as yet.
Jeronymo Dalapicolla
May 04 2018 14:40

Hello, I'm Jeronymo and I have a question about an error message in steps 6 and 7.

I'm using 238 samples and step 6 was taking too long to run. I was able to parallelize this step in my university cluster (HPC/Flux). However, when the step reaches 100%, after 2 days running (96GB in 24 cores - maximum for me as a student), the job didn't stop and it kept running until the time I set, 4 days. I received a notification that the job was aborted due to lack of time. I previously ran a test analysis with few samples to know the program and the command lines and everything went well. I compared the outputs from those runs (test and this 100% aborted-run) and apparently they are ok. Some co-workers told me sometimes one of the nodes can be stuck and could not finish the job. They has already had this problem before.

I thought everything was ok, so I ran the step 7, also parallelized (same core and memory), but at the end of the "100% filtering loci" the following error occurs:

[####################] 100% filtering loci | 2:06:49 ERROR:ipyrad.assemble.write_outfiles:error in filter_stacks on chunk 0: EngineError(Engine '87ce68b4-541eb86efaa3078ad3c9b103' died while running task u'7cc66d3a-ae75c472ed85ca7ab9176dc3')
ERROR:ipyrad.core.assembly:IPyradWarningExit: error in filter_stacks on chunk 0: EngineError(Engine '87ce68b4-541eb86efaa3078ad3c9b103' died while running task u'7cc66d3a-ae75c472ed85ca7ab9176dc3')

Encountered an error (see details in ./ipyrad_log.txt)
Error summary is below -------------------------------
error in filter_stacks on chunk 0: EngineError(Engine '87ce68b4-541eb86efaa3078ad3c9b103' died while running task u'7cc66d3a-ae75c472ed85ca7ab9176dc3')

On the internet this error "error in filter_stacks on chunk 0" is associated with a popfile but I am not using any popfile. I reran the step 7 without paralleling and I find a similar error "error in filter_stacks on chunk 5386" in the step "0% writing VCF" with 12GB. Maybe the memory wasn't enough.

I didn't have problems in run step 7 in the test. I think there's a chance of the files generated in step 6 are corrupted. Should I run step 6 again? Or this message is another problem, like the paralleling approch set wrongly or lacking of memory? The step 6 take 2 days to run, so before I rerun it I would like to know if someone could help me. Thanks for your time

Isaac Overcast
May 04 2018 16:55
@Gazza007 In fact we use BWA-MEM internally, so you're in luck. If you look in the .json file of your assembly you'll see a key called _hackersonly, which is a "dictionary" of hidden parameters, one of which is bwa_args. You can update this parameter with whatever arguments you'd like to specify for bwa and ipyrad will pass these through. Is that kind of what you were looking for?
@jdalapicolla If step 6 was killed before it completed then theres a strong chance the output files are corrupted. It would be good to re-run step 6 and include the -d flag to generate debug output to the ipyrad_log.txt file. Also, what substep of step 6 reached 100% and never completed?
Jeronymo Dalapicolla
May 04 2018 18:30

@isaacovercast Thanks for the reply. The substep in step 6 is the "building database", I think that is the last one in this step, following the tutorial.

I thought I found the solution. I was putting in the "min_samples_locus" parameter a number above 200 to have avoid too much missing data (I have 238 samples). But my samples are for a rodent genus with great interspecific divergence in Cit b. I put 150 samples and I did not parallelize and worked, step 7 ran without errors. However, the number of loci was low 5,000. I think step7 did not find shared loci to all samples with 200 samples. I will do a subsampling again from step 2 and run step 6 with the -d flag as you suggested, removing more samples with a low number of reads (I allowed 100,000 reads per samples, I read in forum the ideal was 300,000 or more). But I guess the problem was not in step 6 but my "min_samples_locus" parameter. Thanks again for your time!

Isaac Overcast
May 04 2018 20:56
@jdalapicolla Yes, min_samples_locus will get you every time if you set it too high, especially if there are lots of divergent populations. Glad you got it sorted out.