These are chat archives for dereneaton/ipyrad

21st
Feb 2018
Glib Mazepa
@mazepago_twitter
Feb 21 2018 10:38
@isaacovercast so i showed you the quota for $HOME - 5 Gb - where conda and ipyrad are installed, right? The program is executed in /scratch/ where i have plenty of space: my current quota there is 5 TB and 2'000'000 files... I could see just 1 possibility in this case: if it is a problem with $HOME space, I could install ipyrad in /scratch/ directly, right?
Eaton Lab
@eaton-lab
Feb 21 2018 16:39
@mazepago_twitter When step 6 finishes it will take up considerably less disk space, it just creates a lot of temporary files while it is running but then removes them as each is consumed in order and stored into a single database file. That is the process we call "indexing clusters". It seems you are getting an error from running out of disk space, but if your project_dir is in scratch then ipyrad will only write output files to scratch. The only exception is the ipyrad_logfile, which will be written to your current directory, but this should not be a very large file. However, because the error is being caught by the logging module in Python, I'm guessing that you are running out of disk in your HOME directory as a consequence of ipyrad writing to the log file. If you are running with the -d flag then ipyrad may write a lot of information to this file. So try not using that. And if the log file is big right now you can delete it. Try opening up just a bit more space in HOME so you have more wiggle room to run things without being right on the edge of filling it up.
Peter B. Pearman
@pbpearman
Feb 21 2018 17:39
I need to do post-ipyrad processing to do three things: 1. select random SNPs from all samples in an alignment, 2. select random SNPs from a subset of samples (keeping only SNPs that remain polymorphic), and 3. select all SNPs and samples, conditioned on the SNPs being present in an arbitrary sample. Is there an efficient approach to doing this kind of thing? I tried putting my 83 x 600,000 .u.snps.phy aligment into an R tibble but (surprise) it choked. If someone could point me to a solid strategy, even in very general terms, I could probably take it from there (in Python or R or??). By the way, I thought of starting with the .u.snps.phy file because it is a format I can use for RAxML and the SNPs are thinned to 1 per locus. I'm certainly open to a better approach, and there's no reason to (naively) re-invent the wheel if it isn't necessary.
Maybe there is an appropriate tool I don't know about yet :smile:
toczydlowski
@toczydlowski
Feb 21 2018 17:59
@dereneaton @isaacovercast Hello Deren and Isaac. I am running ipyrad0.7.13 de novo with GBS data. I have param11 [mindepth_statistical] and param12 [mindepth_majrule] both set to 8, but I am getting hundreds of loci where all read depths are between 1 and 7 (extracted from .vcf). For the sample locus I looked at directly in .loci, there are only 2 SNPs in this locus. One SNP is only in a few individuals and only has low read depths, the other is in enough individuals to be included, but only 22 of the 261 individuals have a read depth above 8 for this SNP, and the max is 10. Thoughts? Thanks.
Deren Eaton
@dereneaton
Feb 21 2018 18:44
@pbpearman I can share with you some Python code to very efficiently sample SNPs from the database file the way that you described if you'd like. But, depending on what type of downstream analyses you plan to run, a much simpler option may be to use some of the tools we developed that have this type of functionality builtin as part of the ipyrad-analysis toolkit. See examples for raxml, structure, bucky, bpp, treemix, abba-baba, etc. here: https://ipyrad.readthedocs.io/analysis.html
@toczydlowski the mindepth filters are applied at step5, did you change their settings after this step?
toczydlowski
@toczydlowski
Feb 21 2018 19:03
@dereneaton Not sure I follow. I set all of the values (including mindepths) in a param file and run the program on a cluster one step after the next from 1-7... So all param values are set once and do not change for steps 1-7.
Isaac Overcast
@isaacovercast
Feb 21 2018 19:13
@toczydlowski There are 2 different ways people speak about "depth", and it's somewhat confusing sometimes. For ipyrad the mindepth settings apply to the number of reads clustered per locus (clustered within individuals at step 3). The practical value here is that locus depth corresponds with confidence in basecalling (at step 5). Sometimes people also talk about "depth" as locus coverage among samples (# of samples with data at a given snp/locus). Can you paste in one line from your vcf that shows the behavior you're describing?
toczydlowski
@toczydlowski
Feb 21 2018 20:01
@isaacovercast @dereneaton Yes, I am talking about depth at a locus for a specific individual. Here is the beginning of a line from the vcf for the sample locus I referred to earlier: locus_83564 87 . C T 13 PASS NS=261;DP=1034 GT:DP:CATG ./.:0:0,0,0,0 0/0:1:1,0,0,0 0/0:5:5,0,0,0 0/0:3:3,0,0,0 0/0:5:5,0,0,0 ./.:0:0,0,0,0 0/0:14:14,0,0,0 0/0:4:4,0,0,0. It is my understanding that to be included here all individuals should have a depth of at least 8 (based on my mindepths). One is 14, but there are also samples with depths of 1,3,4, and 5 in this small example. If all samples were at the minimum (unlikely), DP should = at least 8x261=2088 right?
Glib Mazepa
@mazepago_twitter
Feb 21 2018 20:46
@eaton-lab @isaacovercast thanks guys! I ran through step 6, but then s7 got terminated : "ERROR All chunk dimensions must be positive (All chunk dimensions must be positive)". In the '_outputs' folder I have (although specified * output_formats): 18_Feb_bash_90_steps12345.alleles.loci
18_Feb_bash_90_steps12345.hdf5
18_Feb_bash_90_steps12345.loci
18_Feb_bash_90_steps12345_stats.txt
18_Feb_bash_90_steps12345.vcf
tmp-18_Feb_bash_90_steps12345.h5
Peter B. Pearman
@pbpearman
Feb 21 2018 21:08
@dereneaton Yes, and yes! I would appreciate getting scripts you are willing to share. I have two outgroups that need to be included in phylogenies, and they may share few SNPs with the ingroup. It would be great to be able to filter for all SNPs that are in a particular small set of samples, then output an aligment file and/or merge with a random sample of SNPs from the same data (many ways to do that). So any functionality along the lines in my first post would be very helpful. Maybe I can modify what you have already started. Also, I plan to do RAxML, BPP, and STRUCTURE runs, so the toolkit will be a great help.
Isaac Overcast
@isaacovercast
Feb 21 2018 21:40
@mazepago_twitter can you rerun step 7 with the -d flag and post the last handful of lines of the ipyrad_log.txt
Isaac Overcast
@isaacovercast
Feb 21 2018 21:47
@toczydlowski This could be related to this issue: #225.
My strong belief is that the filtering is right (mindepth is being honored), but that reporting of depth in the vcf file is wonky.
Ivan Prates
@ivanprates
Feb 21 2018 22:19
Hi @isaacovercast and @dereneaton, I'm trying to set different trim lengths (param 17, Min length of reads after adapter trim) with GBS in step 2. Yet, I get the exact same numbers in my s2_rawedit_stats.txt files when applying trim = 35, 70, 100, 130, and 200 (i.e., even when longer than the actual reads!). Am I missing something? Thanks!
Isaac Overcast
@isaacovercast
Feb 21 2018 22:38
Hey Ivan! Hm, that's weird, it should work. Let me look at it.
Isaac Overcast
@isaacovercast
Feb 21 2018 23:04
@ivanprates Ok I found the problem and fixed it but i'm having trouble pushing the new package to conda. The mac package is up, but the linux one is being goofy. I'll work on it.
toczydlowski
@toczydlowski
Feb 21 2018 23:05
@isaacovercast I just read through all of that issue. Thanks for pointing me there. Yes, I also noticed in the meantime that there are many mistakes in my .vcf file of 0/0 being 0,0,X,0 for some individuals but other individuals at the same locus also called 0/0 being 0,X,0,0, for example... in addition to getting things like heterozygous calls with a read depth of 1 according to the vcf. It wasn't totally clear to me what I can trust from the .vcf file. (example vcf line: locus_28448 12 . G A 13 PASS NS=315;DP=32652 GT:DP:CATG 0/0:106:0,0,0,106.) It sounds like I can't use any of the depths. Is the rest of the info reliable? If you could highlight/bold the info that you know is safe in my example vcf line that would be awesome. I am doing some inbreeding and site freq. spectra work, so accurate genotype calls are essential to me, and sometimes I extract and use A,T,G,C and sometimes 0/0,1/0,etc.. Is there a way to test that mindepth is truly being honored (or have you already done this)? And is there any way for me to get read depth info (besides the broken vcf)?