Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Sep 20 01:51
    isaacovercast closed #358
  • Sep 20 01:51
    isaacovercast commented #358
  • Sep 20 01:50

    isaacovercast on hotfix

    Fix a nasty bug with stats for … (compare)

  • Sep 19 23:37
    isaacovercast commented #358
  • Sep 19 21:24
    isaacovercast opened #358
  • Sep 18 15:40

    isaacovercast on hotfix

    Fix step 3 to allow some tmpchu… (compare)

  • Sep 16 15:27

    isaacovercast on hotfix

    Fix a bug in bpp.py (compare)

  • Sep 13 20:03
    AlexaDiNicola commented #357
  • Sep 13 18:54
    isaacovercast closed #357
  • Sep 13 18:54
    isaacovercast commented #357
  • Sep 13 18:31
    AlexaDiNicola commented #357
  • Sep 13 14:25

    isaacovercast on master

    Update faq.rst (compare)

  • Sep 13 13:50

    isaacovercast on master

    Update faq.rst (compare)

  • Sep 12 22:27
    isaacovercast commented #357
  • Sep 12 21:48
    AlexaDiNicola opened #357
  • Sep 08 20:12

    isaacovercast on hotfix

    Fix a nasty error in jointestim… (compare)

  • Sep 08 19:28
    isaacovercast opened #356
  • Sep 07 18:55
    isaacovercast commented #281
  • Sep 07 17:33
    tahamimo commented #281
  • Sep 07 16:12
    isaacovercast commented #355
Tanisha Williams
@tmw216
@isaacovercast I had no problem with the tutorial data (is this the simulated data?)
Isaac Overcast
@isaacovercast
@tmw216 Yes, it's simulated data. Here is the important line:
/dev/disk2s2 3.6Ti 3.5Ti 104Gi 98% 1125683 4293841596 0% /Volumes/Thunderbolt
The directory you're working out of only has ~100Gb of free space, this is almost certainly getting consumed by step 1.
Can you free up space on that drive? Or move to G-DRIVEUSB, though still you'll want to be careful about disk usage.
Good luck!
Tanisha Williams
@tmw216
@isaacovercast I am moving over to the G-DRIVEUSB now. Thank you so much for all your help!!
Isaac Overcast
@isaacovercast
+1
Tanisha Williams
@tmw216
@isaacovercast Quick Update - everything was resolved and the assembly is working fine. Thank you again for all your help.
Isaac Overcast
@isaacovercast
@tmw216 :thumbsup: Glad to hear it. Thanks for letting me know.
Yexiaying
@yexiaying
image.png
@isaacovercast Hi, Issac, I am runing the bpp analysis tool but encountered an error, I conducted this analysis following the ipyrad document https://ipyrad.readthedocs.io/analysis.html
Isaac Overcast
@isaacovercast
@yexiaying Yep, that's a bug in bpp.py. I fixed it in bdfd0f4..e4d2116. You can fix it in your notebook by importing Params from the util module like this:
from ipyrad.analysis.utils import Params
Bradley Martin
@btmartin721
@isaacovercast Hi. I'm wondering if I can pull out specific loci from the .loci file based on the VCF file. Do the loci numbers in the .loci file line up with those in the .vcf file?
Yexiaying
@yexiaying
@isaacovercast I have imported the Params class and there are information output when using help(Params), but the error still produced when ran bpp.
image.png
Isaac Overcast
@isaacovercast
@btmartin721 Yes, the locus numbers agree between the vcf and the loci file.
@yexiaying You will either need to checkout the most recent version of the code from the repository, or wait until we push a new conda package.
tim-oconnor
@tim-oconnor

Hi @isaacovercast,

I'm trying to interpret a disparity between two parts of the *stats.txt file in the *_output directory produced by ipyrad v0.9.13. Here's a summary at the top:

                            total_filters  applied_order  retained_loci
total_prefiltered_loci                  0              0         157483
filtered_by_rm_duplicates            8365           8365         149118
filtered_by_max_indels              10508          10508         138610
filtered_by_max_SNPs                10248           1621         136989
filtered_by_max_shared_het            262             19         136970
filtered_by_min_sample             123882         119546          17424
total_filtered_loci                153265         140059          17424

And then in the locus coverage summary it looks like there's only 1125 loci:

     locus_coverage  sum_coverage
1                 0             0
2                 0             0
3                 0             0
4                 0             0
...

536               0          1125
537               0          1125

Are there additional filters being applied to the data that are not summarized at the head of the *stats.txt file? I'd certainly take another 16000 loci if I could get them. Or is there something amiss in the summary at the top of the file, and I in fact have 1125 loci in the final dtaset? Thanks for any help!

Here's the params file if its helpful:
------- ipyrad params file (v.0.9.13)-------------------------------------------
lig-denovo-c90                 ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
/global/scratch/toconnor/radcap/combo ## [1] [project_dir]: Project dir (made in curdir if not present)
Merged: lig1-s12, lig2-s12, lig3-s12, lig4-s12, lig5-s12, lig6-s12 ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
Merged: lig1-s12, lig2-s12, lig3-s12, lig4-s12, lig5-s12, lig6-s12 ## [3] [barcodes_path]: Location of barcodes file
Merged: lig1-s12, lig2-s12, lig3-s12, lig4-s12, lig5-s12, lig6-s12 ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo                         ## [5] [assembly_method]: Assembly method (denovo, reference)
                               ## [6] [reference_sequence]: Location of reference sequence file
pair3rad                       ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TAATT, ATGCA                   ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
6                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.9                            ## [14] [clust_threshold]: Clustering threshold for de novo assembly
1                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
2                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
80                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.05                           ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
0.05                           ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
100                             ## [21] [min_samples_locus]: Min # samples per locus for output
0.2                            ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
6                            ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.5                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
l, p, s, n, k, a, g, G, u, v, t, m ## [27] [output_formats]: Output formats (see docs)
                               ## [28] [pop_assign_file]: Path to population assignment file
                               ## [29] [reference_as_filter]: Reads mapped to this reference are removed in step 3
Isaac Overcast
@isaacovercast
Hm, that does look a little funny. What does it look like if you do this:
grep "|" /global/scratch/toconnor/radcap/combo/lig-denovo-c90_outfileslig-denovo-c90.loci | wc -l
should give an accurate count of the actual # of loci in the final dataset.
tim-oconnor
@tim-oconnor
@issacovercast This confirms I do indeed have 1125 loci in the dataset. I guess I asked my question a little imprecisely, though -- do you think that there should be 1125 based on the specified filtering in the params file, or is it possible there should be 17K? Is there any way to peek at exactly what happened in the filtering stages to make sure it went as anticipated? Now that there's no .log file I wasn't sure where to look. Thanks!
Isaac Overcast
@isaacovercast
@tim-oconnor Give me a few minutes. Those numbers should agree, so i'm not sure what's happening.
tim-oconnor
@tim-oconnor
Alrighty, thanks!
tim-oconnor
@tim-oconnor
@isaacovercast Just a note that I previously observed some incongruous numbers like this when I had set max_SNPs_locus (which now accepts a proportion) to an integer (like in v0.7.x). Perhaps there's something else malformed in my params?
Yexiaying
@yexiaying
@isaacovercast Thanks for your information, when will the new version be released?
VTonzo
@VaninaTonzo
Hi @issacovercast, I have been using your easySFS script to obtain the sfs from my ipyrad .vcf. Everything works just fine but I am wondering what is the meaning of the MSFS.obs numbers in the third row of the file. I understand that if I have 4 samples (diploids) and my sfs looks like this: 2476 2954 3501 592 894 155 0 0, the first number (2476) would be the number of identical sites to the reference or the number of sites with missing data? The second number would be the number of sites with one variant, the third number, the sites with two variants and so on? Thanks in advance as always!!!
Isaac Overcast
@isaacovercast
@VaninaTonzo The numbers in the third row are the values of each bin of the multi-sfs. The first bin is the number of monomorphic sites. Your interpretation of the 2nd and 3rd bins is basically right. Missing data can't be included in the SFS (which is the reason for easySFS and the projection strategy).
Isaac Overcast
@isaacovercast
@yexiaying Within the next week or two.
tim-oconnor
@tim-oconnor

Hi @isaacovercast -- some more followup.

I was scanning through the .loci file and noticed that although I set max_Indels_locus to 6, it looked like I had at most 1 or 2 (and usually no) indels in the loci that were written out.

So I increased max_Indels_locus to 20 and found that the header of the stats file now agreed with information reported farther down:

                            total_filters  applied_order  retained_loci
total_prefiltered_loci                  0              0         157483
filtered_by_rm_duplicates            9071           9071         148412
filtered_by_max_indels                718            718         147694
filtered_by_max_SNPs                10997          10364         137330
filtered_by_max_shared_het            287             34         137296
filtered_by_min_sample             139008         134314           2982
total_filtered_loci                160081         154501           2982
     locus_coverage  sum_coverage
1                 0             0
2                 0             0
...
534               0          2982
535               0          2982
536               0          2982
537               0          2982

Confusingly, when I changed max_Indels_locus back 6, I get the same behavior I mentioned before: many more loci reported in the header than are in the final assembly. If I bump max_Indels_locus to 100 I get more loci, but some of them look pretty fishy (as expected).

Not sure if that helps -- perhaps I am misinterpeting or misusing the max_indels_locus parameter.

Isaac Overcast
@isaacovercast
@tim-oconnor Hmmmm, that is curious behavior. Thanks for the additional info. Let me look into it, maybe it's something to do with the PE?
tim-oconnor
@tim-oconnor
Could be -- that would be an easy and (in retrospect) obvious fix. Thanks, I'll try now and report back. I did start my assemblies with two values before, but I found that only one value was retained in the params file after merging my datasets. I interpreted this as ipyrad correcting me, but I guess there could have been a single value lurking in one of the pre-merged params files that caused the post-merge to have a single number. Anyway more soon.
VTonzo
@VaninaTonzo
@isaacovercast Thanks Isaac!!!
tim-oconnor
@tim-oconnor
@isaacovercast Unfortunately I get the same behavior when I set max_Indels_locus to 6, 6 as when it is set to 6.
Isaac Overcast
@isaacovercast
Ok, thx. I have 3rad data that i'm trying to replicate this behavior on.
Isaac Overcast
@isaacovercast
@tim-oconnor Lol, this explains that behavior ;p
        self.maxinds = self.data.params.max_Indels_locus
        if isinstance(self.maxinds, tuple): 
            self.maxinds = self.maxinds[0]  # backwards compatibility
Only cares about the first value.
Isaac Overcast
@isaacovercast
@tim-oconnor in your _across directory there's a file that looks like this: _clust_database.fa. Can you wetransfer or dropbox me that file?
tim-oconnor
@tim-oconnor
@isaacovercast Welp, that'll do it. The curious thing to me is why "6" is a magic number that breaks the filtering output. I've changed the max_Indels_locus to 8, 20, 100, and some other values and they all work as expected (even if they're not a good idea).
Isaac Overcast
@isaacovercast
Bizarre. did you try smaller numbers?
tim-oconnor
@tim-oconnor
Just ran filtering with max indels set to 2, and this also broke the filter reporting. The database is heading to you on wetransfer in a sec
tim-oconnor
@tim-oconnor
In any case it's clear now that this is just a nuisance rather than a real problem affecting the final data. The number of loci appearing in the outfiles is what I would expect given the filtering, even when the reporting in the stats file is off.
Isaac Overcast
@isaacovercast
@tim-oconnor Hey, can you send me your .json file for that assembly? It would be useful for debugging.
tim-oconnor
@tim-oconnor
En route
Isaac Overcast
@isaacovercast
@tim-oconnor Thx. dereneaton/ipyrad#358
ajbarley
@ajbarley

Hi all, a couple questions:

1) I am aligning my RAD data to a reference genome that is divergent from my study group and using the ‘denovo+reference’ option. I noticed that none of the output files contain the sequences or snp calls for the reference genome. Is there an option to have ipyrad include that information in the output files? Or do you have any suggestions about the easiest way for me to get that information out of ipyrad?

2) With respect to the vcf file output format, what are the variant calls in the ‘REF’ column? I'm not sure they are from the reference genome because they always seem to be present in the file, whether the locus is aligned to a reference or not. Also, in the ‘CHROM’ column of the vcf, all the variants are assigned a locus "number”. Is there an easy way to figure out the association between these loci numbers and scaffold number they map to on the reference genome?

Thanks for your help!

Isaac Overcast
@isaacovercast
@tim-oconnor Wow, this was a NASTY bug. I fixed it 79352a2..46f2708. See #358 for details.
Thanks for hanging in there with me.
tim-oconnor
@tim-oconnor
@isaacovercast Happy to be of service! I am guaranteed to keep breaking stuff, and I appreciate your help.