These are chat archives for dereneaton/ipyrad

31st
Aug 2017
tommydevitt
@tommydevitt
Aug 31 2017 15:26
Hi @dereneaton and @isaacovercast , I'm trying to work through some of the ipyrad API analysis tools (e.g., the BPP Jupyter notebook), and am confused about connecting to an ipyparallel cluster running on my HPC. From the BPP notebook: "You will need to have an ipcluster instance running in a separate terminal on your machine (or ideally, it is running on your HPC cluster)." So, do I open a new terminal on my laptop, connect to my HPC, and then run ipcluster start --n 48 ? If that's the case, do I run the command from a login node, or do I need to connect to a compute node? Trying to wrap my head around running different jobs in parallel on my HPC.
Isaac Overcast
@isaacovercast
Aug 31 2017 18:25
@tommydevitt There is a nice page on our docs site that explains all about how to do this: http://ipyrad.readthedocs.io/HPC_Tunnel.html?highlight=hpc
check it out and let us know if you have any questions.
tommydevitt
@tommydevitt
Aug 31 2017 18:39
Thanks @isaacovercast . On that page, there is a link to ipyparallel tutorial__) that doesn't go anywhere; is there a separate tutorial for ipyparallel?
toczydlowski
@toczydlowski
Aug 31 2017 19:19
@dereneaton Hi Deren. Back to the SNP filtering issue. I opened the .phy file in Geneious, and there are N's where a locus didn't exist in an individual and no dashes (-). This seems fine. When I open the .snp.phy file, there are N's and also dashes (-) now. What are the dashes and where did they come from (why weren't they in the .phy file)? Sometimes some individuals have an N and others have -'s for the same position. What is the difference? This weirdness seems to correspond to the tails created by concatenating reads where ApeK1 cutsites were close together (like we discussed). I think what is happening is SNPs in these fringy tails are being included in the final dataset, because the locus was in >75% of the individuals (my param no.21), but the tails are only in a few individuals, and SNPs in those few tails at a locus are being called. So in the end .snp.phy has SNPs in less than 75% of individuals. Does this seem correct/are you aware of this behavior? A current issue for me is I want to use the .vcf file, but the messy SNPs are included, and I don't know of a non-trivial way to filter the .vcf file to match e.g. the .u.snp.phy file (which is the cleanest with regard to these low representation SNPs - not surprisingly). I haven't confirmed yet if everything in .u.snp.phy is in 75%+.
@dereneaton. A snapshot of what I describe above from .snp.phy in Geneious.
fringe ends gbs.png
Deren Eaton
@dereneaton
Aug 31 2017 19:29
@toczydlowski I see what you mean, the two characters are being treated a little ambiguously in some cases. An N can mean it is unknown either because (1) the locus is missing, or (2) the base call is statistically ambiguous; and a - can mean either (1) there is data for the locus and this site is an indel, or (2) this is in the flanking region of a locus with data of variable lengths among samples...
Maybe the flanking regions should be encoded as Ns...?
That way - would only mean that it is thought to be an indel, and N always means unknown. That seems reasonable.
Isaac Overcast
@isaacovercast
Aug 31 2017 19:47
@tommydevitt I actually don't know where that is supposed to link to.... There's more info about running ipcluster on the other HPC docs page: http://ipyrad.readthedocs.io/HPC_script.html?highlight=HPC#optional-controlling-ipcluster-by-hand
Deren Eaton
@dereneaton
Aug 31 2017 19:48
Oh yeah, I never finished making the thing it was supposed to link to.
Was planning a more in depth tutorial on it.
James Clugston
@Cycadales_twitter
Aug 31 2017 21:11
@dereneaton @isaacovercast just moved my data over to an AWS online instance and I am wondering if this looks ok with two samples? should there be so many Vsreach jobs running? This was the command I used ipyrad -p params-Carm2.txt -c 64 -t 32 -s 3 -r >& Carm2 &
Screen Shot 2017-08-31 at 22.07.00.png
Screen Shot 2017-08-31 at 22.07.44.png
toczydlowski
@toczydlowski
Aug 31 2017 23:49
@dereneaton I agree Deren. I did check the u.snp.phy file and of about 7,700 SNPs, none had more than 25% Ns. So yes it seems to be an issue of stuff in those tails being dragged along and called as SNPs, but when the best SNP is picked for .u.X files, all SNPs fit the filtering (although some -'s are just a product of tails and not true indels). I agree that the tails and true indels should be coded distinctly, and filling N's for the tails makes sense to me (since it's a product of the digest). Best case scenario in my mind would be to check that cut site sequence exists at end of 5' tails and start of 3' tails before filling with Ns though. Is there a reason we wouldn't want to check the cut site exists as just stated and then just chop those tails off the final loci (instead of coding them as Ns) - perhaps just by running those loci through cutadapt again with the enzyme cut site specified as a 3' and 5' "adapter". You'd have to be sure you were finding tails and not a cut site in the middle of a locus though obviously. Thoughts on filtering the contents of .vcf file to match e.g. u.snp.phy?