These are chat archives for dereneaton/ipyrad

23rd
Mar 2017
James Clugston
@Cycadales_twitter
Mar 23 2017 09:59
@isaacovercast @dereneaton Hi guys I am getting a new problem that is happening during step six or seven and it seems to be with the latest update. I few weeks ago I completed a really big run with around 200 samples and it did take around 8-10 days to complete on a 84 core cluster. However, the final output I got with regards to the number of loci was really low. Included in the run was two species and a putative hybrid between the species. I know nothing is wrong with my parameters file settings wise. I also ran one of the species separately and got the same results! This was for nine populations of the same species plus material from cultivation. I have attached a couple of files but I am rerunning of one the species now as I think something strange happened during step six. Any ideas? with the Card dataset I have actually got far more loci previously so its quite strange.
R2C2.lab
@R2C2_Lab_twitter
Mar 23 2017 10:05
@isaacovercast this is the message I got
sample [Mdil12] failed. See error in ./ipyrad_log.txt
sample [MC-RB-5-YOB] failed. See error in ./ipyrad_log.txt
sample [MC-O1] failed. See error in ./ipyrad_log.txt
sample [MC-RB-4-YOWB] failed. See error in ./ipyrad_log.txt
sample [MC-RB-2-RB] failed. See error in ./ipyrad_log.txt
sample [mCAPsample2] failed. See error in ./ipyrad_log.txt
sample [MC-R1] failed. See error in ./ipyrad_log.txt
[####################] 100% clustering | 0:00:00
[####################] 100% building clusters | 0:00:08
[####################] 100% chunking | 0:00:00
[####################] 100% aligning | 0:00:00
[####################] 100% concatenating | 0:00:00
no clusters found for Mflabsample8
no clusters found for MC-RB-5-YOB
no clusters found for MC-RB-3-YOP
no clusters found for MC-RB-4-YOWB
no clusters found for mCAPsample2
no clusters found for Mdil12
no clusters found for MC-RB-2-RB
no clusters found for MC-O1
no clusters found for MC-R1
no clusters found for MC-RB-1-RPB
The problems seems to be in the clustering and crashes in step 3
R2C2.lab
@R2C2_Lab_twitter
Mar 23 2017 10:14
This is the log file from clustering
Isaac Overcast
@isaacovercast
Mar 23 2017 13:42
@R2C2_Lab_twitter Well, yeah, that doesn't look right at all. Can you post the results of ipyrad -r -p <your params file>
Isaac Overcast
@isaacovercast
Mar 23 2017 13:51
@Cycadales_twitter Hey James, long time no see. Hope alls well. I see the two params and stats files you sent, but I don't exactly get what the problem is. Is the problem with Carm or Cycas?
Jenny Archibald
@jenarch
Mar 23 2017 13:55
@isaacovercast Thanks for the quick assistance! I've now continued the run with nodes=4:ppn=8, and it started fine at least. Any idea if that will be sufficient with a walltime of 168 h to make progress on step 6? There are more cores available that I may be able to use, but I'm trying to leave enough for other users. What was concerning me was that the log acted like it hadn't done anything in each week of analyses ("0% clustering"), so it's good to hear that maybe that's fixable with more cores and/or time.
Isaac Overcast
@isaacovercast
Mar 23 2017 13:56
The Carm stats don't look too bad, looks pretty normal. you can see that most loci are getting filtered by min_sample and max_alleles, which is something we've seen before with your data. Your clustering threshold (0.80) is fairly permissive, so that could be the root of that problem. Cycas looks okay except for the value of min samples per locus is really high, so you're filtering 99% of the loci. I'm sure there's a good reason for this that I don't understand.
@jenarch Well, I guess we'll see. Each cluster and each dataset have their own little quirks, so it's hard to extrapolate. I would say 168 hours should be enough, but yeah, ask me again in 169 hours :-p
If you look in the _clust_0.85 directory while it's in the clustering step you should see stuff changing. Lets see what happens when it gets there.
R2C2.lab
@R2C2_Lab_twitter
Mar 23 2017 16:00
@isaacovercast sure, here it is
Isaac Overcast
@isaacovercast
Mar 23 2017 16:06
@R2C2_Lab_twitter Oh sorry i mean actually run this command an post the results:
ipyrad -p params-denovo-reference.txt -r
R2C2.lab
@R2C2_Lab_twitter
Mar 23 2017 16:06

2 regina@ccmar-r2c2-01:~/19samples⟫ ipyrad -p params-denovo-reference.txt -r

Summary stats of Assembly 19samples-denovo-symb-cutadapt

          state  reads_raw  reads_passed_filter

MC-O1 2 1432643 1432630
MC-R1 2 1536740 1536725
MC-RB-1-RPB 2 1515309 1515301
MC-RB-2-RB 2 1343668 1343644
MC-RB-3-YOP 2 1346612 1346582
MC-RB-4-YOWB 2 1467096 1467070
MC-RB-5-YOB 2 1144086 1144056
Mdil12 2 3916872 3904320
Mflabsample8 2 4016508 4001547
mCAPsample2 2 4486464 4467914

Full stats files

step 1: ./19samples-denovo-symb-cutadapt_s1_demultiplex_stats.txt
step 2: ./19samples-denovo-symb-cutadapt_edits/s2_rawedit_stats.txt
step 3: ./19samples-denovo-symb-cutadapt_clust_0.85/s3_cluster_stats.txt
step 4: None
step 5: None
step 6: None
step 7: None

Apparently, didn’t do any filtering
Isaac Overcast
@isaacovercast
Mar 23 2017 16:17
Can i see the s3 file as well?
R2C2.lab
@R2C2_Lab_twitter
Mar 23 2017 16:18
Is this notation for the sequences correct? MC-O1R1.fq It is not properly displayed but R1 is separated by 2 underscores; one after O1 and another after R1
Isaac Overcast
@isaacovercast
Mar 23 2017 16:25
Yeah that's fine.
@R2C2_Lab_twitter How big is the raw data? Can you dropbox me a sample or two and the reference sequence?
R2C2.lab
@R2C2_Lab_twitter
Mar 23 2017 16:56
You mean the fastq.gz files before using cutadapt correct?
Deren Eaton
@dereneaton
Mar 23 2017 18:09
@jenarch, maybe try running step 6 without using MPI (e.g., run on a single 8-core node). It may be that MPI is not initializing correctly on your system, which would make things run super slow. It should very quickly start moving past 0% clustering if working properly. Also, make sure to request a reasonable amount of RAM, maybe like 4G per core (I use the SLURM argument #SBATCH --mem-per-cpu 4000). Let me know if that works and then we can try to troubleshoot the MPI mode after.
Deren Eaton
@dereneaton
Mar 23 2017 18:22
@all and @ewarschefsky_twitter There have been a few posts recently about long running jobs (e.g., >150 hours), which in my experience should be quite rare when many processors are being used. In general, I would guess that libraries which take this long to run are probably overloaded with singleton reads, meaning reads are not clustering well within or across samples. This can happen for two main reasons: (1) Your data set actually consists of a ton of singleton reads, which is often the case in libraries that use very common cutters like ezRAD; or (2) Your data needs to be filtered better, because low quality ends and adapter contamination are causing the reads to not cluster.
@all I had a kind of 'messy' data set recently that had a lot of quality issues and was also taking a long time to cluster and after aggressive filtering the assembly ran about 5X faster and recovered about 2X as much data in the end. Here are some ways to filter more aggressively:
(1) Set filter_adapters to 2 (stringent=trims Illumina adapters)
(2) Set phred_Qscore_offset to 43 (more aggressive trimming of low quality bases from 3' end of reads.
(3) Hard trim the first or last N bases from raw reads by setting e.g., trim_reads to (5, 5, 0, 0)
(4) Add additional 'adapter sequences' to be filtered (any contaminant can be searched for, I have added long A-repeats in one library where this appeared common). This can be done easily in the API, but requires editing the JSON file for the CLI.
James Clugston
@Cycadales_twitter
Mar 23 2017 21:05
@isaacovercast well not been on for while as I have had not problems until recently. its more that I am getting exactly the same amount of loci for both the datasets and it make me a little unsure. I actually was filtering for minus of 60% of samples with the same locus coverage for population genetics. Let me test with a different clustering threshold.
@dereneaton I had to do the same with my data but I did this before ipyrad could really do it. As it seems a combination of exRAD and NextSeq can get a little messy. I used TRIMMOMATIC to clean the data up and also only retained pared reads. Do you think ipyrad could do a better job then TRIMMOMATIC now?
Deren Eaton
@dereneaton
Mar 23 2017 21:13
@Cycadales_twitter ipyrad uses cutadapt which should do just as good of job as trimmomatic, but it really depends on which arguments you use with the program. If you ran it yourself and ensured that it was trimming everything you wanted trimmed then that will probably be more accurate than ipyrad's method since we use a fixed set of arguments that we expect to work generally across most data sets, and provide options to only change a few of the major options (e.g., quality cutoff, where to hard trim, which adapters to look for).
Deren Eaton
@dereneaton
Mar 23 2017 21:20
@Cycadales_twitter I'm not sure I understand what you mean by getting the exact same number of loci in both data sets. It looks like you have 178 loci in the Cycas data set and 11,100 in the Carm data set. It does seem like you are losing a ton of data to filtering though (max_indels and max_snps), which makes me suspicious that something is up.
Isaac Overcast
@isaacovercast
Mar 23 2017 21:41
@R2C2_Lab_twitter Yes, that's right, fastq.gz
James Clugston
@Cycadales_twitter
Mar 23 2017 22:19
@dereneaton well I have not really played with the max indels or max_snps. Do you have any recommendations for population data? All I did with them was double the numbers as I have using 150 PE dataset. Although I do not exactly know if that is the right thing to do here.