These are chat archives for dereneaton/ipyrad

24th
Oct 2017
ChaoShenzjs
@ChaoShenzjs
Oct 24 2017 02:40
@isaacovercast Thank you,I have found the reason why I failed in demultiplexing the dataset, in barcode txt,I used "number+barcode sequence"at first,e.g."1 ATCAG", however, when I changed the format such as "BC-1 ATCAT",it ran well.I guess that the number only cannot be recognized as character string.What's more, I sequenced more than 600 individuals (divide into 13 pools) after a ddrad library preparation.Now,I want to demultiplex the 13 datasets in Step 1 respectively, and then pool the all individuals in one directory,towards running the Step 2 to 7.Although I kown it will be a very big dataset!What do you think?WAITTING FOR YOUR SUGGESTION!
Katherine Silliman
@ksil91
Oct 24 2017 14:04

@dereneaton @isaacovercast I am trying to make a reproducible notebook with the API. I made one using an older version of ipyrad but wanted to do one with the newest version with prettier API functions. I keep having issues with Step 1 hanging though. It will get to 90% and then stop sorting (verified with ls -l) but on the notebook the time will keep ticking . In the working directory there is an empty tmp-chunks folder and a _fastqs folder which looks like this:

 $ ls -l
total 30567288
-rw------- 1 ksilliman ksilliman      43839 Oct 24 08:26 tmp_2850_0.p
-rw------- 1 ksilliman ksilliman      44131 Oct 24 08:26 tmp_2885_0.p
-rw------- 1 ksilliman ksilliman  133223598 Oct 24 08:26 tmp_BC4_21_C1_R1_2850.fastq
-rw------- 1 ksilliman ksilliman  135597114 Oct 24 08:26 tmp_BC4_21_C1_R1_2885.fastq
-rw------- 1 ksilliman ksilliman   12970128 Oct 24 08:26 tmp_CA1_14_C1_R1_2850.fastq
-rw------- 1 ksilliman ksilliman   13119443 Oct 24 08:26 tmp_CA1_14_C1_R1_2885.fastq
-rw------- 1 ksilliman ksilliman    2289555 Oct 24 08:26 tmp_CA1_16_C1_R1_2850.fastq
-rw------- 1 ksilliman ksilliman    2376121 Oct 24 08:26 tmp_CA1_16_C1_R1_2885.fastq
-rw------- 1 ksilliman ksilliman   77945173 Oct 24 08:26 tmp_CA1_17_C1_R1_2850.fastq
-rw------- 1 ksilliman ksilliman   81380428 Oct 24 08:26 tmp_CA1_17_C1_R1_2885.fastq
-rw------- 1 ksilliman ksilliman  164063097 Oct 24 08:26 tmp_CA1_20_C1_R1_2850.fastq
-rw------- 1 ksilliman ksilliman  166411565 Oct 24 08:26 tmp_CA1_20_C1_R1_2885.fastq
-rw------- 1 ksilliman ksilliman  328791537 Oct 24 08:26 tmp_CA1_21_C1_R1_2850.fastq

etc.

I submitted the notebook to my cluster as a single node, 8 core job:

XDG_RUNTIME_DIR=""
ipnport=$(shuf -i8000-9999 -n1)
ipnip=$(hostname -i)

echo -e "
    Copy/Paste this in your local terminal to ssh tunnel with remote
    -----------------------------------------------------------------
    ssh -N -L $ipnport:$ipnip:$ipnport name@host
    -----------------------------------------------------------------
    Then open a browser on your local machine to the following address
    ------------------------------------------------------------------
    localhost:$ipnport  (prefix w/ https:// if using password)
    ------------------------------------------------------------------
    "
jupyter-notebook --no-browser --port=$ipnport --ip=$ipnip

and then once I connected to the notebook on my local machine, I opened a terminal from the Jupyter Home page and started ipcluster. It shows on the notebook that it is connected to the ipcluster instance.
Any thoughts?

Isaac Overcast
@isaacovercast
Oct 24 2017 17:26
@ChaoShenzjs Wow, that's a big dataset. I hope you have LOTS of RAM. :+1: Good luck, and let me know how it goes!
vsoza
@vsoza
Oct 24 2017 18:43
@dereneaton @isaacovercast I have been analyzing some new data with ipyrad version 0.7.15 and have noticed that a large percentage of loci are filtered_by_rm_duplicates in a reference assembly versus a denovo assembly of the same dataset in comparison to previous analyses I have done. I thought this could be a version-specific issue so I re-analyzed a dataset that I had done in ipyrad version 0.5.15 using the same parameters and a reference assembly in ipyrad 0.7.15. I am noticing 2 large discrepancies between the 2 different versions. (1) More loci are recovered per sample in step 3 in ipyrad 0.7.15 (clusters_total=80,168, clusters_hidepth=57,863) versus ipyrad 0.5.15 (clusters_total=30,809, clusters_hidepth=15,483). (2) A higher percentage of loci are filtered_by_rm_duplicates in step 7 in ipyrad 0.7.15 (72%) versus ipyrad 0.5.15 (6%). Could you explain why more loci/sample are recovered in step 3 in ipyrad 0.7.15 versus 0.5.15 with a reference assembly? Do you also know why more loci are filtered_by_rm_duplicates in step 7 in ipyrad 0.7.15 versus 0.5.15 with a reference assembly? Would you expect for these 2 discrepancies to also happen with a denovo assembly of the same dataset between the 2 versions? Thanks.
Robin K Bagley
@rkbagley_twitter
Oct 24 2017 19:13
Hi all, I see a ticket was closed for potentially introducing capability to produce a VCF file calling one SNP per locus, once it was discovered that vcftools --thin option will technically sample the first SNP per locus. I can do this as a start, or convert the .u.str file to what I need, but I was wondering if there is any intent to formally introduce this feature? It would be really nice to have!
nspope
@nspope
Oct 24 2017 21:29

@rkbagley_twitter an aside note ... you don't need vcftools, as this sort of simple text processing can be done with core bash/gnu utilities -- for example, to filter to the first SNP on the contig with awk:

awk '/^#/ !/^#/ && !seen[$1]++' my.vcf > my.thinned.vcf

or to first randomize the VCF (so as to get a single random SNP per contig):

awk '/^#/; !/^#/ {print|"shuf"}' my.vcf | awk '/^#/; !/^#/ && !seen[$1]++' > my.thinned.vcf
toczydlowski
@toczydlowski
Oct 24 2017 22:17
@dereneaton @isaacovercast Possible bug with loci filtering! I just switched from 0.5.15 to 0.7.13 and am noticing similar behavior with a denovo assembly @vsoza . I just came on to also state that the ambiguity translation seems to be correct now (in 0.5.15 IUPAC ambiguity codes were being translated wrong in the vcf file so instead of 2 alleles with heterozygotes I got 3 alleles at a position). However, I found at least one SNP in my dataset where there are 3 different alleles at a position, even though I have [18] [max_alleles_consens] set to 2. This is the only SNP in that locus. What gives?? This is in V0.7.13.
tommydevitt
@tommydevitt
Oct 24 2017 22:24

@isaacovercast @dereneaton I'll try to be more specific. I'm working through my BPP jupyter notebook (following the tutorial example) and did

b.run(
    nreps=4, 
    ipyclient=ipyclient, 
    randomize_order=True,
    force=True
    )

which returned

submitted 4 bpp jobs [test_r3] (10 loci)

then

b.files

which returned

data        /work/ipyrad/BPP_Sept3_outfiles/BPP_Sept3.alleles.loci
mcmcfiles   ['/work/ipyrad/analysis-bpp/test.mcmc.txt', '/work/ipyrad/analysis-bpp/test_r0.mcmc.txt', '/work/ipyrad/analysis-bpp/test_r1.mcmc.txt', '/work/ipyrad/analysis-bpp/test_r2.mcmc.txt', '/work/ipyrad/analysis-bpp/test_r3.mcmc.txt']
outfiles    ['/work/02745/ipyrad/analysis-bpp/test.out.txt', '/work/ipyrad/analysis-bpp/test_r0.out.txt', '/work/ipyrad/analysis-bpp/test_r1.out.txt', '/work/ipyrad/analysis-bpp/test_r2.out.txt', '/work/ipyrad/analysis-bpp/test_r3.out.txt']

I checked the directory though, and none of the outfiles are there. So bpp never ran, even though when I check the async objects from the bpp object, it says the job finished. Why aren't the jobs running? I started an ipcluster instance running and the engines are running successfully.