These are chat archives for dereneaton/ipyrad

11th
Apr 2018
James Clugston
@Cycadales_twitter
Apr 11 2018 07:33

@isaacovercast @dereneaton Hi guys I am having a problem with some ezRAD data produced using HiSeq 4000 and 150 PE run. Everything is fine other then we get an error during clustering.

```Reading file /mnt/4TB/Beadman/Encephalartos/Data1/Encep1_edits/171109_K00166_0298_AHMFNTBBXX_8_TP-D7-004_TP-D5-002PairedTrim_derep.fastq 100%
796865010 nt in 3212558 seqs, min 40, max 264, avg 248
Masking 100%
Counting unique k-mers 100%
Clustering
2018-04-10 21:44:32,854 pid=6033 [assembly.py] ERROR IPyradError(cmd ['/home/ubuntu/miniconda2/lib/python2.7/site-packages/bin/vsearch-linux-x86_64', '-cluster_smallmem', '/mnt/4TB/Beadman/Encephalartos/Data1/Encep1_edits/171109_K00166_0298_AHMFNTBBXX_8_TP-D7-008_TP-D5-002PairedTrim_derep.fastq', '-strand', 'both', '-query_cov', '0.75', '-id', '0.85', '-minsl', '0.75', '-userout', '/mnt/4TB/Beadman/Encephalartos/Data1/Encep1_clust_0.85/171109_K00166_0298_AHMFNTBBXX_8_TP-D7-008_TP-D5-002PairedTrim.utemp', '-userfields', 'query+target+id+gaps+qstrand+qcov', '-maxaccepts', '1', '-maxrejects', '0', '-threads', '2', '-notmatched', '/mnt/4TB/Beadman/Encephalartos/Data1/Encep1_clust_0.85/171109_K00166_0298_AHMFNTBBXX_8_TP-D7-008_TP-D5-002PairedTrim.htemp', '-fasta_width', '0', '-fastq_qmax', '100', '-fulldp', '-usersort']: vsearch v2.0.3_linux_x86_64, 137.5GB RAM, 72 cores
https://github.com/torognes/vsearch

Reading file /mnt/4TB/Beadman/Encephalartos/Data1/Encep1_edits/171109_K00166_0298_AHMFNTBBXX_8_TP-D7-008_TP-D5-002PairedTrim_derep.fastq 100%
703224084 nt in 2977037 seqs, min 40, max 264, avg 236
Masking 100%
Counting unique k-mers 100%
Clustering)
2018-04-10 21:44:34,149 pid=6033 [assembly.py] ERROR shutdown warning: [Errno 3] No such process```

danielyao12
@danielyao12
Apr 11 2018 09:51
@isaacovercast Thanks for your time. You really help me a lot. I will try your suggestion later and perhaps reply you in few days.
Ollie White
@Ollie_W_White_twitter
Apr 11 2018 13:53

Hi @isaacovercast, thanks again for helping with my earlier issue with regards to de-multiplexing my paired end data. Annoyingly I've hit another issue at the clustering step. I think I have a reasonable number of reads and cluster per sample but most if not all in some cases are removed by the minimum depth requirement. See step 3 output file below:

cat des-c80_clust_0.8/s3_cluster_stats.txt
         clusters_total  hidepth_min clusters_hidepth avg_depth_total avg_depth_mj avg_depth_stat sd_depth_total sd_depth_mj sd_depth_stat filtered_bad_align
art_B82          138299          6.0             1518            1.35         7.00           7.00           0.94        1.40          1.40                  0
art_B83           86239          6.0              764            1.33         7.10           7.10           0.90        1.55          1.55                  0
bour_51v         212037          6.0                0            1.12          nan            nan           0.37         nan           nan                  0
bour_563         125072          6.0              483            1.23         6.82           6.82           0.69        1.25          1.25                  0
bour_GH          104039          6.0              860            1.38         6.75           6.75           0.91        1.14          1.14                  0
dep_C26          119796          6.0             2049            1.42         7.30           7.30           1.12        1.67          1.67                  0
gil_131a          76606          6.0               37            1.19         6.70           6.70           0.52        1.01          1.01                  0
gil_B163         148712          6.0             1288            1.29         7.19           7.19           0.87        1.55          1.55                  0
gon_B162         126336          6.0              801            1.28         6.90           6.90           0.80        1.28          1.28                  0
gon_GHA           68186          6.0              174            1.24         6.39           6.39           0.66        0.55          0.55                  0
lem_98a          188671          6.0                0            1.11          nan            nan           0.35         nan           nan                  0
lem_98b          221011          6.0                0            1.17          nan            nan           0.49         nan           nan                  0
lem_B157         202154          6.0                0            1.04          nan            nan           0.20         nan           nan                  0
mil_125a         193275          6.0                0            1.26          nan            nan           0.66         nan           nan                  0
mil_128b         192133          6.0             2045            1.30         7.37           7.37           0.91        1.72          1.72                  0
mil_94v           86637          6.0               92            1.21         6.30           6.30           0.58        0.58          0.58                  0
mil_GHA          103316          6.0                0            1.19          nan            nan           0.46         nan           nan                  0
pre_B120         113170          6.0                0            1.17          nan            nan           0.44         nan           nan                  0
pre_GHA          102020          6.0                0            1.28          nan            nan           0.65         nan           nan                  0
tan_C6            98087          6.0                0            1.35          nan            nan           0.77         nan           nan                  0

I have assembled the forward reads in isolation in the past and I had no issue with the results so not sure whats happening here. Any suggestions or parameters to try would be much appreciated. The params file I am using is below:

cat params-des-c80.txt
------- ipyrad params file (v.0.7.23)-------------------------------------------
des-c80                        ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
/scratch/oww1v14/des.GBS.ipyrad ## [1] [project_dir]: Project dir (made in curdir if not present)
/scratch/oww1v14/des.GBS.ipyrad/MAC_RAW_GBS00289_L7_R* ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
/scratch/oww1v14/des.GBS.ipyrad/barcodes.txt ## [3] [barcodes_path]: Location of barcodes file
                               ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo-reference               ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)
/scratch/oww1v14/des.GBS.ipyrad/reference-plastid-genomes/reference-plastid-genome.fasta ## [6] [reference_sequence]: Location of reference sequence file
pairddrad                      ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TA, GCGC                       ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
6                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.80                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
2                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
5, 5                           ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
8, 8                           ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
4                              ## [21] [min_samples_locus]: Min # samples per locus for output
20, 20                         ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
8, 8                           ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.2                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
G, a, g, k, m, l, n, p, s, u, t, v ## [27] [output_formats]: Output formats (see docs)

Cheers, Ollie

Isaac Overcast
@isaacovercast
Apr 11 2018 15:38
@Cycadales_twitter Is it just this one sample? Did you try just running this command by hand?
'/home/ubuntu/miniconda2/lib/python2.7/site-packages/bin/vsearch-linux-x86_64', '-cluster_smallmem', '/mnt/4TB/Beadman/Encephalartos/Data1/Encep1_edits/171109_K00166_0298_AHMFNTBBXX_8_TP-D7-008_TP-D5-002PairedTrim_derep.fastq', '-strand', 'both', '-query_cov', '0.75', '-id', '0.85', '-minsl', '0.75', '-userout', '/mnt/4TB/Beadman/Encephalartos/Data1/Encep1_clust_0.85/171109_K00166_0298_AHMFNTBBXX_8_TP-D7-008_TP-D5-002PairedTrim.utemp', '-userfields', 'query+target+id+gaps+qstrand+qcov', '-maxaccepts', '1', '-maxrejects', '0', '-threads', '2', '-notmatched', '/mnt/4TB/Beadman/Encephalartos/Data1/Encep1_clust_0.85/171109_K00166_0298_AHMFNTBBXX_8_TP-D7-008_TP-D5-002PairedTrim.htemp', '-fasta_width', '0', '-fastq_qmax', '100', '-fulldp', '-usersort'
Isaac Overcast
@isaacovercast
Apr 11 2018 15:49
@Ollie_W_White_twitter You could try lowering the mindepth majrule parameter to see if that recovers more reads, indicating that you do have just lots of low depth clusters. If you assembled R1 and it worked fine, introducing R2 could cause oversplitting if the R2 reads are very noisy. Did you inspect some of the files in fastqc?
Peter B. Pearman
@pbpearman
Apr 11 2018 16:17

I am trying to run ipyrad on ubuntu locally (n=24) using jupyter, in order to align pairedGBS data. ipcluster seems to start engines locally ok

ipyrad_E_umbellatum2$ ipcluster start --engines Local --n 20
2018-04-11 16:52:59.204 [IPClusterStart] Starting ipcluster with [daemon=False]
2018-04-11 16:52:59.205 [IPClusterStart] Creating pid file: /home/bgppermp/.ipython/profile_default/pid/ipcluster.pid
2018-04-11 16:52:59.205 [IPClusterStart] Starting Controller with LocalControllerLauncher
2018-04-11 16:53:00.214 [IPClusterStart] Starting 20 Engines with Local
2018-04-11 16:53:32.268 [IPClusterStart] Engines appear to have started successfully

Then things get confusing. I try to check out the process:

ipyclient=ipp.Client()

and receive this warning:

/home/bgppermp/anaconda2/lib/python2.7/site-packages/ipyparallel/client/client.py:458: RuntimeWarning:
Controller appears to be listening on localhost, but not on this machine.
If this is true, you should specify Client(...,sshserver='you@Tathagata')
or instruct your controller to listen on an external IP.
RuntimeWarning)

but the local engines seem to be recognized:

ipyclient.ids

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

When I start step 1,

data1.run("1",ipyclient=ipyclient, force=True)

after a minute and a half or so, the run seems to stop, as shown on the gnome-system-monitor:

I lose access to the mounted afp file system on which the data are stored. The symlink to the remote file system becomes inactive. No ipyrad error message is generated and there are no memory capacity problems on the local machine or on the server. The file system is mounted via nautilus and reference via a symbolic link. Any ideas? Why would running ipyrad lead to loss of the mounted filesystem? I also note that one processor seems to max-out completely, and stay that way.

Ollie White
@Ollie_W_White_twitter
Apr 11 2018 17:02
image.png
Hi @isaacovercast , yes I will have a go at lowering the min depth requirement to see if that makes a difference. Just did some fastqc of samples with more and less reads (art_B82 and tan_C6). They both look pretty good). The attached photo has the reverse reads for art_B82 and tan_C6 respectively.
Isaac Overcast
@isaacovercast
Apr 11 2018 17:13
@pbpearman Have you tried using ipyrad in CLI mode? If you look at top what is the process that's maxed out? So this is an apple file system mount? If it's a remote mount it might be doing something to antagonize the remote file server, exceeding bandwidth cap? Check out the logs on the server to see why it's killing the link.
@Ollie_W_White_twitter Still, it couldn't hurt to try increasing the value for trim_read on R2 to shave off more of the distal end. Trim it down to 100bp, just to see if that helps. I'll bet it'll help.
Peter B. Pearman
@pbpearman
Apr 11 2018 17:26
@isaacovercast , the process is from gvfsd-afp. I guess that suggests a problem with the connection to the server. Could be a cause, or could be a symptom. It is a 1-Gb/s line between the server and the local machine. I have not tried CLI, but I can tomorrow and get back to you.