These are chat archives for dereneaton/ipyrad

25th
Mar 2018
Isaac Overcast
@isaacovercast
Mar 25 2018 15:08
@danielyao12 I sent you a more detailed email regarding these issues. The short version is you can fix the first warning can be safely ignored, and the second hdf5 error can be resolved by setting an environment variable like this: export HDF5_USE_FILE_LOCKING=FALSE
Isaac Overcast
@isaacovercast
Mar 25 2018 15:20
@PaoloMomigliano_twitter Hey Paolo, Yes, I remember fixing this issue. rm_duplicates are loci with multiple hits within one sample during the "clustering across" step. You can think of these as paralogs or pseudo-paralogs. The reliability of datasets produced with versions < 0.7.16 is not guaranteed. I would expect the extreme amount of rm_duplicate filtering to be highly biased, and therefore I'd imagine you'd want to redo your assembly if you have the option. I would recommend rerunning from step 3 for most accurate results. On the up side you're going to get a LOT more loci/snps with the fixed version.
@bernnad Ooops! Yeah, it's a bug, very common, called 'developer forgetfulness', we updated 0.7.23 for the mac package, but not for the linux package. I have resolved this by pushing 0.7.23 for linux, so you can try updating again and it'll work this time.
Isaac Overcast
@isaacovercast
Mar 25 2018 15:31
@Ollie_W_White_twitter Hm, that's totally weird, I really don't know what could be going on. If you want to try dropboxing me the R1 and R2 files i can look at it.
Step 1 first splits the huge raw data files into a bunch of somewhat more manageable sized files to parallelize the demux process across cores. This error really is being caused by the number of split files for R1 and R2 being different, which is just bizarre.
Isaac Overcast
@isaacovercast
Mar 25 2018 15:37
@Ollie_W_White_twitter Also, can you show me an ls -l in the ipyrad working directory, as well as the *_fastqs directory inside the working directory? Also can you paste in the first 6 or 7 lines of your params file?
Isaac Overcast
@isaacovercast
Mar 25 2018 15:43
@emhudson Well, the min_samples filter is pretty reliable. This could be caused by a couple things. If you actually have no loci with sample depth greater than 1 this would happen, so if your clustering threshold is too high it'll just oversplit everything, but this still should retain loci that are monomorphic across samples. The other thing that could be doing it is if all your loci are shorter than the filter_min_trim_len, so double check your value here isn't too high. We can take a look at some of the clusters found during step 6 to see what they look like. If you look in your working directory there should be a directory that looks like <your_assembly>_across and inside this directory is a file that ends with _catclust.gz. Can you execute the following command and email me the output?
gunzip -c *_across/*_catclust.gz | head -n 50
You have to cd to your working directory for this to work.
@joqb v.0.7.23 is now up on conda for linux.
@cwessinger What version of ipyrad are you running? This sounds like a malformed pop assignments file. Can you email me the file you're using so I can take a look at it? I'll pm you me address.
Isaac Overcast
@isaacovercast
Mar 25 2018 15:54
@cwessinger Did you create a branch and filter out some samples after step 6?