These are chat archives for nextflow-io/nextflow

14th
Apr 2015
Paolo Di Tommaso
@pditommaso
Apr 14 2015 18:41
@andrewcstewart In the case you missed it http://www.nextflow.io/docs/latest/amazons3.html
This message was deleted
Andrew Stewart
@andrewcstewart
Apr 14 2015 19:51
whoa!
I can only assume that is awesome. For some reason I can't get to the website again.
Must have something to do with how we're routing to S3
is the documentation also on github (even if in raw form or something?)
and yes it is.
btw this might be of interest to some: http://aws.amazon.com/efs/
Paolo Di Tommaso
@pditommaso
Apr 14 2015 19:57
oh, still that problem.
that's so strange, I'm wondering if it could be something on my side
anyway, you can find also here
Andrew Stewart
@andrewcstewart
Apr 14 2015 19:59
I dont think so. I confirmed with my network guy that all S3 addressed traffic is routing through our DirectConnect connection
an im getting there's a firewall rule in the way
since its not designed by default for http traffic
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:01
can you open this ?
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:01
btw im considering trying to run nextflow as an ECS service (elastic container service .. ie native AWS docker deployment)
nope
can't open www.nextflow.io either
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:02
well, if so can't be a dns problem
there's should be something messed in your firewall rules
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:02
yup
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:03
Yes, I read about EFS. Finally!
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:03
right? only took forever.
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:03
took forever what?
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:04
to develop EFS
they leave a lot of things to the community to figure out.. like nfs
hence things like StarCluster exist
which is fine, but it's great when it graduates to full native service
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:05
yes, they announced it for the end of the year, isn't it ?
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:05
what im picturing is running nextflow as an ECS service that executes runtimes in other ECS containers, with the containers all mounting EFS
yes I believe so
though you can sign up for early access now
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:06
I've already signed-up :)
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:06
you just get placed in a queue?
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:07
about ECS it sounds interesting, though I'm little skeptic that's a good pattern
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:07
how come?
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:07
well, I've just filled up the form. they said they will let me try it soon ..
with ECS you have to provide a the conf with a json file
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:09
to configure the container
tell ECS where to get the image, etc
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:09
maybe, it could be a way
anyway, I've managed to integrate the Cirrus cluster with nextflow and I'm pretty happy with it
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:10
I think you just install nextflow inside a docker image.. pass in any ENVs through the configuration... pull a pipeline from S3 or from github/bitbucket..
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:11
yep, but the scheduler it's the missing piece using ECS
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:12
I knew I forgot about something :D
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:12
:)
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:12
I just started exploring ECS today
you could actually use ECS to kickoff cluster deployment first
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:15
if you need to run jobs in the cloud with spot instances I would suggest to give a try a ClusterK
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:15
is clusterk different from cirrus?
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:16
cirrus is the name of the scheduler they developed, ClusterK is the name of the company
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:16
gotcha
I could never find good documentation searching ClusterK
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:17
yes, they don't published it
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:17
cirrus is doing the trick though
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:18
yes, is that even though a bit outdated
however they provide a command line tools very similar to qsub, qstat etc
that submit the job in a cloud cluster made up of ec2 spot instances
and you can use es3 as shared storage
(for this reason I've added the support for it in nextflow)
anyway, I gave a look to the doc at this link https://cirruscluster.readthedocs.org/en/latest/#
and is very old
it's completely different now
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:23
is there more updated doc somewhere?
or are you a commercial client?
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:24
currently I have an evaluation license
to access the doc you need an account on the platform
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:29
gotcha
definitely interesting
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:31
and nextflow works like a charm with it :)
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:34
question on the s3 channel handlers
whats going on behind the scenes?
or rather, what happens if I point to a huge file on s3?
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:35
files are copied from S3 to the node for computations and results copied back to S3
but at the end the same happens with NFS
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:39
so it will copy the files from S3 into the current nextflow process's workdir ?
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:39
yes
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:41
gotcha
what if those are huge fastq files and I dont want them in my workdir?
what would be the best way to keep them outside of the workdir?
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:42
um, the process workdir ?
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:42
yeah
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:43
well, if you need to process them you will need to copy them.
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:43
I suppose I could delete them in the process once im done processing them
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:44
even when you read them with NFS actually you are transferring it from the storage to the local node
ah
I get your point
up to now they stay there, but it could be easily added a workdir clean-up directive to nextflow
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:45
btw.. my thinking here is that these fill up space fast, and there's not really a point in keeping them around after alignment... or at least putting them on a different drive.
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:46
it's a matter of rm -rf workdir on job completion
:)
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:46
wait.. at what point (and thus what location) are they downloaded?
like lets say I create a channel to input s3 paths
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:47
yep
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:47
and thats the input to my process, which performs an alignment producing a bam file?
obviously I dont want to remove the bam file though
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:48
you can remove it from the process workdir because it will be copied to the pipeline workdir
on S3
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:48
I suppose the logic should be: if (bam file == good): rm fastq
?
I dont follow what you just said
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:49
you can discard all the workdir on the ec2 node because the outputs are copied in the S3 storage used by nextflow to track your execution
is it clear?
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:50
ohhhhhhhhhh
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:50
gotcha!
:)
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:50
this totally changes my thinking
so are they -necessarily- copied?
or is that just an option?
do you have any example code actually?
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:51
-necessarily
I run this for example
but it's the usual code, it does not need any change
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:53
so you're saying that basically nextflow just reads and writes all files to S3
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:53
yep
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:53
nothing hits the local fs?
I mean besides maybe intermediates ?
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:54
let's recapitulate slowly:
when a process need to executed nextflow copies all the files you declared as input from S3 to the local fs
the job is executed in the local fs and so all intermediate files are store there
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:56
right
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:56
when complete nextflow will copy only the files you declared in the output section to S3
and so on ..
Andrew Stewart
@andrewcstewart
Apr 14 2015 20:57
and is that still in a workdir created by nextflow or does this break that paradigm?
ah gotcha
Paolo Di Tommaso
@pditommaso
Apr 14 2015 20:57
exactly the same mechanism
it's easier to try it than to explain it :)
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:01
so this is just handled as a process directive then?
Paolo Di Tommaso
@pditommaso
Apr 14 2015 21:02
yes, you only need to define:
 process.executor = 'cirrus'
 process.queue = 'queue-name' 
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:03
to use S3?
Paolo Di Tommaso
@pditommaso
Apr 14 2015 21:03
-w s3://bucket/path on the command line
pretty neat, isn't it?
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:04
do I need to use the cirrus executor though in order to use S3?
(very neat)
Paolo Di Tommaso
@pditommaso
Apr 14 2015 21:04
no
S3 it's just another file system that you can use from any nextflow script
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:05
Ok I was just confused why you dropped the cirrus reference
so yeah, thats a pretty slick implementation of S3 channels
Paolo Di Tommaso
@pditommaso
Apr 14 2015 21:06
you need simply to create an handle to a S3 file like this:
my_file = file('s3://bucket/some/file/name')
then you can read, write as usual
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:07
or you can do -w to apply to all?
or thats separate
I see.. the two work together
Paolo Di Tommaso
@pditommaso
Apr 14 2015 21:07
it's separate
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:08
bonus feature
Paolo Di Tommaso
@pditommaso
Apr 14 2015 21:09
what?
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:10
the s3 working directory
Paolo Di Tommaso
@pditommaso
Apr 14 2015 21:10
ah, yes
well, it's starting to become late around here
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:10
so can I as of now cleanup the process directory after the outputs have been sent back to s3 ?
sure
thanks for all the great news!
Im stoked to try this
Paolo Di Tommaso
@pditommaso
Apr 14 2015 21:11
no, now the clean should be managed manually
but I will add it asap
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:11
ok so I can just throw an rm in there or something
Paolo Di Tommaso
@pditommaso
Apr 14 2015 21:11
it's a nice idea when working with ec2 instances
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:12
im also thinking about the analog to Channel.fromPath("s3://")
Paolo Di Tommaso
@pditommaso
Apr 14 2015 21:12
of course
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:12
filtering based on metadata tag:value could be really handy
ok ill let you end yoru day
Paolo Di Tommaso
@pditommaso
Apr 14 2015 21:13
ah, s3 files have tags metadata ?
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:13
yea
Paolo Di Tommaso
@pditommaso
Apr 14 2015 21:13
I was missing this
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:13
need to write them at creation time though
Paolo Di Tommaso
@pditommaso
Apr 14 2015 21:13
that would be very useful
I will investigate about that
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:13
you can add them after the fact, bbut that effectively rm+cp's the object again. fine for small files, but for big files its a pain.
Paolo Di Tommaso
@pditommaso
Apr 14 2015 21:14
I see, it looks interesting
have a nice day
Andrew Stewart
@andrewcstewart
Apr 14 2015 21:14
you too!