These are chat archives for nextflow-io/nextflow

25th
Apr 2017
Paolo Di Tommaso
@pditommaso
Apr 25 2017 08:45
@mes5k sorry Mike, I'm no understanding the problem. Please open an issue with a test that I can replicate.
Mike Smoot
@mes5k
Apr 25 2017 15:59
Once I can create a test case, I'll definitely submit a ticket. It's very strange to me that the same code in one pipeline produces one result, while in a different pipeline does something very different. I'd like to figure out what is different about seemingly identical code. I'll keep you posted if I find anything.
Michael L Heuer
@heuermh
Apr 25 2017 20:13
I'd like to create a channel from a glob over an s3 bucket. Is there enough s3 support already baked in, or should I use an s3 client in a process?
Paolo Di Tommaso
@pditommaso
Apr 25 2017 20:15
it's native supported, just prefix the file paths with s3://
Michael L Heuer
@heuermh
Apr 25 2017 20:18
nice! Is there an isS3 method on the files, or some other way to check if a path is s3://? we'd want to use https://github.com/BD2KGenomics/conductor instead of s3 support built in to nextflow for downloading and uploading, and fall back to normal file access if the paths aren't s3
Paolo Di Tommaso
@pditommaso
Apr 25 2017 20:22
um, you can rely on Java Path api
something like
if( your_file.fileSystem.provider().scheme == 's3' ) {  
  .. 
}
Michael L Heuer
@heuermh
Apr 25 2017 20:27
yeah, that would work! thanks
Paolo Di Tommaso
@pditommaso
Apr 25 2017 20:27
:+1:
Michael L Heuer
@heuermh
Apr 25 2017 20:30

one last thing, while I have yer attention ;)
I have some steps that read from HDFS and write to HDFS, so there aren't any local file inputs or outputs. I'm thinking of just passing through a tuple, e.g.

input:
  set sample, ... from downloaded
output:
  set sample, ... into transformed

Sorry, that isn't a question. Carry on. :)

Paolo Di Tommaso
@pditommaso
Apr 25 2017 20:30
there's always a lot of attention here .. :)
but are those files stored on HDFS ?
Michael L Heuer
@heuermh
Apr 25 2017 20:40
After spinning up an Apache Spark + HDFS cluster, the workflow in nextflow would be 1) glob an s3 bucket for say *.bam files, 2) download from s3 to HDFS using conductor, 3) various Spark jobs reading from and writing to HDFS, 4) upload from HDFS to s3 bucket using conductor, 5) report
Paolo Di Tommaso
@pditommaso
Apr 25 2017 20:41
where NF is sitting here?
is managing this workflow ?
Michael L Heuer
@heuermh
Apr 25 2017 20:42
running on the Spark master node
Paolo Di Tommaso
@pditommaso
Apr 25 2017 20:43
to orchestrate the steps described above ?
Michael L Heuer
@heuermh
Apr 25 2017 20:43
so not doing much in the way of distributed job management, only there for coordinating restarts and such per sample
Paolo Di Tommaso
@pditommaso
Apr 25 2017 20:44
I see, nice
Michael L Heuer
@heuermh
Apr 25 2017 20:44
will see if it works :) beats bash scripts for sure
Paolo Di Tommaso
@pditommaso
Apr 25 2017 20:44
sure ! :)
are you planning to run this in a cloud cluster ?
Michael L Heuer
@heuermh
Apr 25 2017 20:47
Yep, we spin them up on AWS using https://github.com/BD2KGenomics/cgcloud
Paolo Di Tommaso
@pditommaso
Apr 25 2017 20:48
cool