These are chat archives for dereneaton/ipyrad

28th
May 2018
Francisco Pina-Martins
@StuntsPT
May 28 2018 10:31
Hi, I'm having this strange issue:
francisco@Loki [11:15:27] [~/GBS/qsuber_reference_01]
-> $ ipyrad -p params-reference1.txt -s 1234567 -c 6 -d

  ** Enabling debug mode **

 -------------------------------------------------------------
  ipyrad [v.0.7.24]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  New Assembly: reference1
  establishing parallel connection:
  host compute node: [6 cores] on Loki

  Step 1: Demultiplexing fastq data to Samples
  [####################] 100%  chunking large files  | 0:14:16
  [                    ]   0%  sorting reads         |
step one just stays on "sorting reads" forever without using any CPU
never goes over 0%
I'm currently running with -d
here is ipyrad.log
-------------------------------------------------------------
  ipyrad [v.0.7.24]
  Interactive assembly and analysis of RAD-seq data
 -------------------------------------------------------------
  Begin run: 2018-05-28 11:15
  Using args {'preview': False, 'force': False, 'threads': 2, 'results': False, 'quiet': False, 'merge': None, 'ipcluster': None, 'cores': 6, 'params': 'params-reference1.txt', 'branch': None, 'steps': '1234567', 'debug': True, 'new': None, 'download': None, 'MPI': False}
  Platform info: ('Linux', 'Loki', '4.4.0-109-generic', '#132-Ubuntu SMP Tue Jan 9 19:52:39 UTC 2018', 'x86_64')2018-05-28 11:15:30,633   pid=2132        [assembly.py]   WARNING         Some names from population input do not match Sample names: 
2018-05-28 11:15:30,633         pid=2132        [assembly.py]   WARNING         If this is a new assembly this is normal.
2018-05-28 11:15:30,634         pid=2132        [parallel.py]   INFO    ['ipcluster', 'start', '--daemonize', '--cluster-id=ipyrad-cli-2132', '--engines=Local', '--profile=default', '--n=6']
2018-05-28 11:15:38,869         pid=2132        [demultiplex.py]        INFO    zcat is using optim = 8000000
all of these messages were printed there during the chunking large files part
so I guess ipyrad is just stopping silently
oh, here is my parameters file, if that matters
------- ipyrad params file (v.0.7.24)-------------------------------------------
reference1                     ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
./                             ## [1] [project_dir]: Project dir (made in curdir if not present)
../Qsuber.fastq.tar.gz         ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
../Qsuber.barcodes            ## [3] [barcodes_path]: Location of barcodes file
                               ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
reference                      ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)
/home/francisco/DataSets/Q.suber/Qsuber_draft-1.0.fsa_nt.gz                               ## [6] [reference_sequence]: Location of reference sequence file
gbs                            ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TGCAT, TGCAT                   ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
8                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
8                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.85                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
2                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
5                              ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
8                              ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
4                              ## [21] [min_samples_locus]: Min # samples per locus for output
20                             ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
8                              ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.8                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
l, p, s, v                     ## [27] [output_formats]: Output formats (see docs)
../Qsuber.popfile             ## [28] [pop_assign_file]: Path to population assignment file
Deren Eaton
@dereneaton
May 28 2018 13:19
Hi @StuntsPT, it looks like you are passing in a tar file for the raw data path. You'll need to untar the folder. Your fastq files can be compressed but not the directory structure.
Francisco Pina-Martins
@StuntsPT
May 28 2018 13:20
Thanks @dereneaton !
I'm untaring the file, and will be gzipping it again. So in half hour or so I should be able to report back
Deren Eaton
@dereneaton
May 28 2018 13:33
You will also need to unzip your genome file so it is fasta.
Francisco Pina-Martins
@StuntsPT
May 28 2018 14:06
ok, the genome needs to be unzipped too
gzip is taking a while, though, to "recompress" the raw data
Francisco Pina-Martins
@StuntsPT
May 28 2018 14:56
humm... it must be more than this
The result is currently the same, using an ungzipped "genome" and a "tarless" raw data
sorting reads has been once more stuck at 0% for the last 45 minutes
there is still no CPU activity in htop
Isaac Overcast
@isaacovercast
May 28 2018 16:32
@zapataf Can you post the relevant lines from the ipyrad_log.txt file? If you run in debug mode it should leave temporary files in place, so you should be able to go look in your *_clust directory and you can check whether or not this file it is looking for is actually in place. Also, can you post the exact error message?
@StuntsPT Can you post your params file again?
Francisco Pina-Martins
@StuntsPT
May 28 2018 16:54
I have just worked around the problem by using previously demultiplxed files
give me a sec to get you my params file
------- ipyrad params file (v.0.7.24)-------------------------------------------
reference1                     ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
./                             ## [1] [project_dir]: Project dir (made in curdir if not present)
 ../Qsuber.fastq.gz          ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
../Qsuber.barcodes         ## [3] [barcodes_path]: Location of barcodes file
                                                 ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
reference                      ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)
/home/francisco/DataSets/Q.suber/Qsuber_draft-1.0.fsa_nt                               ## [6] [reference_sequence]: Location of reference sequence file
gbs                            ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TGCAT, TGCAT                   ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
8                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
8                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.85                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
2                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
5                              ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
8                              ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
4                              ## [21] [min_samples_locus]: Min # samples per locus for output
20                             ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
8                              ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.8                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
l, p, s, v                     ## [27] [output_formats]: Output formats (see docs)
../Qsuber.popfile              ## [28] [pop_assign_file]: Path to population assignment file
Katherine Silliman
@ksil91
May 28 2018 21:53

Reposting :) If you can roughly point me to the code on Github that describes how S7 stats are stored now, I can probably read through it and figure out how to access the stats through the API. This was one of my favorite features of the API as it also allowed some great plotting of S7 stats, so I hope it's still in there somewhere.

@isaacovercast @dereneaton I have been using the API in notebooks for filtering low coverage individuals. After updating to 0.7.21, it seems that the Assembly object changed a little in how it stores S7 stats. This code doesn't work anymore:

## get list of samples found in at least 40% of loci after Step 7
skeep = s3filt.stats.index[s3filt.stats_dfs.s7_samples.sample_coverage > loci40].tolist()

It gives this error:

AttributeError: No such attribute: s7_samples

Is there another way to filter samples by their S7 stats properties?

Isaac Overcast
@isaacovercast
May 28 2018 23:01
@ksil91 On line 260 of ipyrad/assemble/write_outfiles.py you'll see the creation of the 's7_samples' stats. I don't believe this has changed recently.
Looking back in the history of the file, this part of step 7 hasn't changed in about 2 years, so it's unlikely to be the problem. Did you already run step 7 on this assembly? Sorry, this is a silly question, but it's the simplest explanation for why 's7_samples' wouldn't exist in the stats dictionary.
Isaac Overcast
@isaacovercast
May 28 2018 23:08
@StuntsPT This is what I suspected. The raw_fastq_path must be a directory. The directory optionally can contain ungzipped fastq files or gzipped fastq files, but it must be a directory that contains these. I think there was a little confusion about what deren had suggested. Is Qsuber.fastq.gz the raw data? If so you can 'mkdir raws' then 'mv Qsuber.fastq.gz raws/Qsuber_R1_.fastq.gz' then use './raws/*.fastq.gz' for this parameter.