These are chat archives for dereneaton/ipyrad

8th
Jan 2018
Isaac Overcast
@isaacovercast
Jan 08 2018 00:34 UTC
@nitishnarula WOW! that is a big dataset. Assuming you are on the current version ipyrad step 6 has pretty good checkpointing. The checkpointing only applies at the end of each substep (each progress bar you see for different parts of the step). If you're clustering across now and you kill it it will start back over at the beginning of this substep, skipping the concat/shuffle step. Clustering across takes a ton of time. If you can throw more cores at it that would help. You can also try experimenting with the -t flag to give the clustering step more mulithreading. Setting -t to 2 or 4 should help speed things up.
Also, more RAM, as much as you can get. If you run out of RAM and the clustering algorithm starts paging out this'll slow you WAY down.
Saritonia
@Saritonia
Jan 08 2018 10:50 UTC
Hi again Deren and Isaac. As Nitish Narula, my ipyrad run is going to be canceled by the scheduler on the cluster in a few days due to time limits. In my case, I am running step 3. When I used pyRAD, I repeated this step independently only with the samples which could not finish the clustering within samples and then I added these last samples to the clust.XX folder. Can I do the same with ipyrad? Is it going to affect to the stats files or any following steps? I would also like to ask you how many nodes and cores you recommend to run 300 samples in reasonable time. Thanks you in advance!!
Siadjeu Christian
@siadjeu2012_twitter
Jan 08 2018 13:08 UTC
Hi @dereneaton and @isaacovercast , I am new here and I used ipyrad for gbs data. I have several questions: I used TreeMix and tetrad to built my inference trees however when I use imap, treemix and tetrad are not able to separate the out-group species to another individuals and when I remove imap option they separate very well my outgroup species, why? Also when I want to know the historical relationship between the individual in the specie that I study, when I try to run treemix it say me that 155 SNPs are written but when I put the out-group species, only 2 SNPs are written. When I run treemix it can't separate the out-group, I want to know why? how to put the migration legend on treemix? I want also to know how to perform PCA analysis with GBS data from denovo assembly using jupyter ipyrad notebook?
I forgot, thanks in advance !!
Siadjeu Christian
@siadjeu2012_twitter
Jan 08 2018 13:14 UTC
I want also to know how to cite correctly when you ipyrad pipeline. Thanks agian in advance!!
jeremycandersen
@jeremycandersen
Jan 08 2018 19:05 UTC

@dereneaton @isaacovercast I'm new to ipyrad, so my apologies in advance for this question. But I'm having a similar problem to the one asked about by @letimm on Feb 28 2017 where during step one the "chunking large files" step ends successfully, but the "sorting reads" step stays at 0% for several days. I'm not using the -d flag which seemed to be the solution for @letimm so, I was wondering what else I'm doing wrong. Here's a bit from the screen: ipyrad [v.0.7.19]

Interactive assembly and analysis of RAD-seq data

New Assembly: YSTtest
establishing parallel connection:
host compute node: [12 cores] on n0000.vector0

Step 1: Demultiplexing fastq data to Samples
[####################] 100% chunking large files | 1:29:14
[ ] 0% sorting reads | 21:11:11

I have PE reads (each file is ~ 17gb zipped), and I have 68 individuals with multiplex barcodes, here is an example of the formatting:
B02-0816-05 agctga tcagct
B02-0816-06 agctga gacact
B02-0816-07 agctga gagcat
B02-0816-08 agctga agtctg
B02-0820-01 agctga catcag
B02-0820-02 agctga tctagc
B02-0820-03 agctga gtgtga
B02-0820-04 agctga tcgtga
B02-0820-05 cactag tcagct
B02-0821-03 cactag gacact

Because the samples are multiplexed, I'm using:
pair3rad ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.

Many thanks again.

Nitish Narula
@nitishnarula
Jan 08 2018 20:20 UTC
@isaacovercast Thanks for the info. The job cancelled when the clustering across substep was at 81%. I'll have to restart it. Previously I had given the job 48 cores, and I think the RAM was 800 GB (not sure if I remember correctly). Should I try more? I didn't change the -t flag. The log says threads was set to 2. For the restart should I try 4 or even 8? I know for some programs excessive parallelization doesn't help. Finally I noticed from the job output that in the last week or so, the progress for this substep increased much faster than the initial days. Is that normal?