These are chat archives for dereneaton/ipyrad

Oct 2017
Oct 25 2017 08:18
Hi, I'm getting the same error than @jebberson related to the File locking system at the end of step 5 (IOError(Unable to create file (File locking disabled on this file system (use hdf5_use_file_locking environment variable to override), errno = 38, error message = 'function not implemented'))). I have talked with my cluster admin and, since implementing the file locking system in our Lustre partitions is not an option, he suggested to set HDF5_USE_FILE_LOCKING variable at FALSE, assuming that no other process will try to write on the HDF5 file. I tried this approach and it worked (1 node - 12 cpus). But does this trick could affect outputs integrity? Thanks!
Ollie White
Oct 25 2017 08:42
Cheers @isaacovercast that makes more sense regarding the mapped reads
Ollie White
Oct 25 2017 13:54

I am trying to run D-statistics on my own data but it seems to be taking a much longer than I would expect based on the example APIs. For just one test tree it has has been running for an hour... Has anyone found a running time similar to this or suggest a possible problem?


Robin K Bagley
Oct 25 2017 14:31
@nspope Thanks for the awk tips! Worked like a charm.
Isaac Overcast
Oct 25 2017 19:08
@toczydlowski The max_alleles_consensis the maximum number of unique alleles allowed in (individual) consens reads. Different individuals can be biallelic at a site for different bases e.g. A/G vs A/T, which would give you this output in your vcf file (one ref and two alt alleles). Does this make sense?
Isaac Overcast
Oct 25 2017 19:21
@vsoza "Could you explain why more loci/sample are recovered in step 3 in ipyrad 0.7.15 versus 0.5.15 with a reference assembly? Do you also know why more loci are filtered_by_rm_duplicates in step 7 in ipyrad 0.7.15 versus 0.5.15 with a reference assembly?" The difference in recovery of loci between these two versions is probably just because of version 0.5.15 is really old and we've made lots of improvements, i can't say exactly what's happening without looking at the data. In terms of the filtered_by_rm_duplicatesissue, this is much more concerning. 72% filtered seems like theres a real issue. What is the reference sequence you're using?
Oct 25 2017 21:12
Thanks for the response @isaacovercast . Ok, I will assume that version 0.7.15 is working properly for step 3. However, yes, differences in loci filtered_by_rm_duplicates in step 7 between the 2 versions is disconcerting. I am using a reference genome for the reference sequence. I used the same reference genome in ipyrad versions 0.5.15 and 0.7.15 with the same RADseq dataset and am definitely getting very different results in step 7 for filtered_by_rm_duplicates. Version 0.7.15 is filtering out 244733 loci as filtered_by_rm_duplicates from 339863 total_prefiltered_loci. Version 0.5.15 is filtering out 8143 loci as filtered_by_rm_duplicates from 145793 total_prefiltered_loci. Let me know if there are any files I can send you to help troubleshoot. Thanks.
Isaac Overcast
Oct 25 2017 22:14
That does seem very weird. Can you dropbox me a couple of the sample fastq files and the reference sequence you're using? I'll try to check it out.