These are chat archives for dereneaton/ipyrad

30th
Jan 2016
Isaac Overcast
@isaacovercast
Jan 30 2016 01:29
This message was deleted
Isaac Overcast
@isaacovercast
Jan 30 2016 01:59
Is there a good reason to gzip output files from step1? If we're gunzipping them in step 2 it's just killing our performance.
Deren Eaton
@dereneaton
Jan 30 2016 02:01
Just being nice to people's disk space. I suppose we could skip it, though, makes our performance look more impressive.
Isaac Overcast
@isaacovercast
Jan 30 2016 02:05
I don't think disk is a limiting factor. Step 1 on this beefy box i got access to just killed the demultiplexing. Massive parallelization was crushing the data, but gzipping the output fastq files is killing us cuz it's serial.
I guess step 1 isn't really the bottleneck in the pipeline. If it takes 1 hour or 3 hours that's not a big whoop. I'll be curious to see how much time we can pick up on step 3.
Deren Eaton
@dereneaton
Jan 30 2016 02:08
yeah, totally. We can just write in the docs that users can feel free to gzip the files themselves if they want to, and that step2 can read in gzip files.
yeah, step3 is where I'm curious too, whether vsearch can thread across multiple nodes.
Isaac Overcast
@isaacovercast
Jan 30 2016 02:13
For long term, do you think we should get stuff in readthedocs? Is that your vision of where shit would live?
by shit i mean docs of course
Isaac Overcast
@isaacovercast
Jan 30 2016 02:18
The actual demultiplexing on a 40 core box took literally 20 minutes, it slayed it. Biggest bottlenecks in step 1 are counting lines in the raw data to optimize chunk size and gzipping output fq files.
Isaac Overcast
@isaacovercast
Jan 30 2016 02:38
Dude! This beefy box just destroyed the data. Step 1 took < 3 hrs on a full plate. We could get that down to <2 with some simple optimizations.
Deren Eaton
@dereneaton
Jan 30 2016 06:22
yeah, we should work towards having docs on rtd. Awesome, 2 hrs ain't bad. What kind of optimizations you have in mind?
Isaac Overcast
@isaacovercast
Jan 30 2016 15:04
bash-3.2$ gunzip -c D25GWACXX_6_fastq.gz | head -n 4000 | gzip > wat.gz
bash-3.2$ ls -l
total 53098224
-rw-r--r--  1 glenn  staff  27186207248 Jan 25 11:41 D25GWACXX_6_fastq.gz
-rw-r--r--  1 glenn  staff        81737 Jan 30 10:02 wat.gz
bash-3.2$ gunzip -c D25GWACXX_6_fastq.gz | head -n 40000 | gzip > wat.gz
bash-3.2$ ls -l
total 53099616
-rw-r--r--  1 glenn  staff  27186207248 Jan 25 11:41 D25GWACXX_6_fastq.gz
-rw-r--r--  1 glenn  staff       791602 Jan 30 10:03 wat.gz
bash-3.2$ gunzip -c D25GWACXX_6_fastq.gz | head -n 400000 | gzip > wat.gz
bash-3.2$ ls -l
total 53113544
-rw-r--r--  1 glenn  staff  27186207248 Jan 25 11:41 D25GWACXX_6_fastq.gz
-rw-r--r--  1 glenn  staff      7925207 Jan 30 10:04 wat.gz
It's almost exactly 80 bytes per read quartet in the raw gz
Isaac Overcast
@isaacovercast
Jan 30 2016 15:15
If we round up to 100, I estimate there are 271 Million reads in the raw data, in reality there are 297 million, which is within 10%.
Close enough for government work. Just an idea. Got any feelings one way or the other about fuzzy estimation of optim in step1() rather than exact?
Deren Eaton
@dereneaton
Jan 30 2016 17:53
Sounds great. Nice.