These are chat archives for nextflow-io/nextflow

8th
Jan 2015
Paolo Di Tommaso
@pditommaso
Jan 08 2015 12:51
@andrewcstewart Unfortunately not yet.
However that is happing is that for to every tasks is associated a unique-id,
that unique is used to create a temporary folder in the work directory.
Then all files you have declared in the input block are staged in the task temporary folder by creating a symlink to the original file.
Andrew Stewart
@andrewcstewart
Jan 08 2015 20:53
Thanks @pditommaso . All the .command* files in the working space are also really informative
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:01
I have a question and am happy to take it to the google groups in a moment but I'll ask here first anyway
I have a very large directory of reference data (a reference genome) that I'd like to make available to multiple processes in my pipeline
Im trying to understand the best way to approach this
(Im also using Docker + SGE so some considerations there may be clouding how I think about this)
What I'm not quite understanding is whether this warrants defining a channel...
or if I can just pass a single string (path to directory) as a parameter to nextflow and then expose that path (via whatever symlink magic nextflow uses) to any given process
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:05
hi
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:06
Hello!
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:06
A process input not necessary has to be a regular file
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:06
I wanted to draft my question there because I'm not quite sure I'm framing it properly before posting on google groups
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:06
can be also a directory
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:06
via file?
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:07
yes
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:07
ie...
input:
file reference from something
1 sec I'm going to create a gist
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:08
yes perfect
let's use the google group
so it may benefit also other people
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:10
ok
this is first time for this type of question?
don't want to reinvent the wheel
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:11
ah ok, I was thinking it was more complex
:)
yes, it's just that
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:12
is that going to copy data or just symlink?
because that directory is huge
50GB or so
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:12
no only the simlink
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:13
awesome
that was really easy
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:13
yep
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:13
so my only real confusion is with the word 'file' for the process input type
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:13
yes I know
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:14
so I won't bother with the google group post
unless you'd like me to :)
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:15
no don't worry
it seemed more complex :)
anyway I've updated the doc about the file function recently
You may want to git it a look
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:17
awesome, thanks a lot
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:18
welcome
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:18
that actually reminds me of one other question
in that example
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:18
good
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:18
It seems like we're doing file() twice
line 4: reference = file(params.reference)
line 10: file reference
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:19
the same word, but two really different semantics
at line 4, file is a function to transform the string to a file object (actually a java Path http://docs.oracle.com/javase/7/docs/api/java/nio/file/Path.html)
at file 10, file means that the value the process will receive from the channel you have specified has to be managed as file
i.e. it will stage it in the task temp directory creating the symlink etc
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:24
So, you may wonder file I need to declare it as a file, when it is a file object ?
because the channel can emit also value types and stage them automatically as a file
for example
look at this
at line 27, it creates a channel seq that emits fasta sequences as string values
but then are stage as files in the process blast at line 31
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:28
so btw...
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:28
yes . . (?)
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:28
DB in that example (and 'reference' in mine)... aren't actually channels correct?
yet they are still staged
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:29
ah, that is a potential bug
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:30
well my line #4 doesn't define a channel
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:30
ah wait
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:30
yet at my line 10 I stage reference as thought it were
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:30
your is ok
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:31
I guess what Im not understanding in that example is why reference isn't a channel but I can stage it in a process.
unless the conclusion is that not all files staged in processes need come from channels
(which would make perfect sense for processes that use a single reference file for multiple input files coming out of a channel.. as in blast, as in genome assemblers, etc etc)
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:33
the point is that you use reference in the input declaration
in my case DB is not declared in the input block, so it is wrong
also
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:34
but assuming you added DB to the input block it would be okay
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:34
yeo
yep
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:34
gotcha
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:34
also the input declaration is:
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:34
ok I understand much better now
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:35
input:
<input qualifier> <input name> [from <source channel>] [attributes]
input: <input qualifier> <input name> [from <source channel>] [attributes]
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:36
so theoretically a better name for 'file' (the input qualifier) might be 'link' ?
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:36
uh
this open another chapter that files are staged as symlinks as long as you are using a local or shared file system
but if you will run in a cloud provider that file will be actually copied.
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:38
ha
ok so maybe 'stage' would be more appropriate
(Im not necessarily suggesting it be changed, I'm just asking for my own edification)
or even 'path'
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:40
yes, are better options
I will put your suggestions in the Nextflow wishlist .. ;)
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:40
haha
I really really really like nextflow btw
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:40
thanks great
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:41
one last question?
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:41
the best way to contribute to the project is spread the word about it
sure
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:41
can I make the following short cut?
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:42
?
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:42
something = file(params.in)
or do I need to explicitly define params.* with a default value?
(I suppose I could trial+error this)
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:42
ahh
yes, you can
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:43
so params don't really need to be declared
it seems
maybe they need to at least be mentioned or an error is thrown
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:44
it's not mandatory, but I find it a good practice because doing that if you don't specify it on the command line you have a default value
Andrew Stewart
@andrewcstewart
Jan 08 2015 21:44
sure
that also seems to be the most natural entry point for unit testing at this point as well
so that a default pipeline can essentially be a test
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:45
yes, we are using exactly in that way
for example
we test them automatically with Circle CI
with the default parameters point to a small reference dataset included in the pipeline
Michael L Heuer
@heuermh
Jan 08 2015 21:59
sorry to jump in here, but CI is something I'm interested in
Paolo Di Tommaso
@pditommaso
Jan 08 2015 21:59
Hello
I've discovered Circle CI recently
it works like a charm!
Michael L Heuer
@heuermh
Jan 08 2015 22:00
we're using Travis CI; is there something similar to a nextflow "compile" command, so that we don't need to run a whole pipeline (even on test data for us this would require a bunch of third party tools to be installed) for CI?
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:01
ok, you mean a kind of dry-run
right?
Michael L Heuer
@heuermh
Jan 08 2015 22:02
yeah just something that returns 0 if the pipeline script is ok
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:02
currently not
Michael L Heuer
@heuermh
Jan 08 2015 22:02
maybe I could just write a gradle build script?
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:03
heuermh, would a docker image help you here?
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:04
yes definitively
the main problem is that nextflow is a scripting lang
Michael L Heuer
@heuermh
Jan 08 2015 22:04
still trying to figure out how we might use docker . . . we would need nextflow + slurm cluster + dozens of bioinformatics applications + a bunch of reference data
from what I understand docker support in nextflow is for running processes in docker images, right?
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:05
yes
at the end it's enough to create a container with all the tools you use
In general I think it's a very good practice not only for testing
Michael L Heuer
@heuermh
Jan 08 2015 22:06
I think we might need at least two, since one is the slurm master and the other would be a slurm slave
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:06
but also to distribute/publish it
you don't to include slurm in the container
you don't need ..
Michael L Heuer
@heuermh
Jan 08 2015 22:07
brad chapman has done some good things with docker (http://bcb.io/2014/12/19/awsbench/) I just need to begin to understand it :)
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:07
yes it's awesome that post
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:08
I tend to notice folks look to Docker to handle the auto-build of everything.. and it most certainly can.. but there are also others tools out there that might be appropriate
(when it comes to things like slum)
but Docker's role .. at least in the context of pipelines .. is really best applied to containing process runtime environments
if you want docker to also contain your execution service (SGE, SLURM, etc)... then you're starting to talk about Docker-in-Docker and things like that
Michael L Heuer
@heuermh
Jan 08 2015 22:09
ok simplified example, say we have the following pieces: bwa, GRCh38 fasta & bwa indices, slurm, nextflow, FASTQ reads. I'm not sure what should go where, in terms of containers
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:10
I have some experience here if you're interested in talking through ideas
ah
Michael L Heuer
@heuermh
Jan 08 2015 22:10
sure I'm all ears
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:10
Im dealing with this exact issue right now
I would say in theory .. one docker image per 'tool'
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:11
I've managed to run rna-seq pipelines with Docker successfully
with a negligible overhead
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:11
the reasoning being maintainability and other runtime considerations like allocating resources (memory, cpu, etc) in large distributed enviornments
but I think as a starting point creating a single docker image is fine
(and then separating things out as necessary)
Michael L Heuer
@heuermh
Jan 08 2015 22:12
currently we have a number of EC2 images built on cloudbiolinux (so tools like bwa) with slurm & nextflow installed, reference data on NFS, FASTQ reads coming from S3, nextflow output to NFS
so we package tools like bwa in docker then, and leave nextflow & slurm at the AMI level?
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:13
yes
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:13
aye
cloudbiolinux is, in this worldview, lots of bloat
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:13
nextflow will schedule the jobs
and each job run in its on container
the good is that you don't have take care of tools deployment in the ec2 nodes
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:14
for the purpose of CI you can also probably leave out slurm for now
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:15
yes, I agree
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:15
since nextflow just handles that as an executor
Michael L Heuer
@heuermh
Jan 08 2015 22:16
cbl for us is just for unfu-cking everything Heng Li does wrong with builds & dependencies :)
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:16
:)
uh? what?
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:16
haha
there are a ton of existing docker images out there for bioinformatics
I can point you to some of the ones Ive been using
(and writing your own is trivial)
Michael L Heuer
@heuermh
Jan 08 2015 22:17
ok so nextflow and slurm in my AMI, and install tools into a docker image. nextflow starts up the docker images then, via submitting jobs to slurm?
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:17
yes
the order of operations is then: nextflow submits process to slurm, which includes the spinning up of a docker container
Michael L Heuer
@heuermh
Jan 08 2015 22:19
ok then it would seem that I would want processes as defined in my nextflow script to be rather large grained then, as to not result in too much overhead
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:19
yes, it makes sense that
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:21
in case you are curious, it is indeed possible to run 'docker within docker'
Michael L Heuer
@heuermh
Jan 08 2015 22:22
in addition to cbl we also install more stuff via linuxbrew and some more stuff via an internal debian repository and do more stuff via puppet after the EC2 images come up, so getting that into a docker container might not be that easy; worth a try though
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:22
if you wanted to (in production) manage slurm via docker
but I wouldn't necessarily recommend that route. More interesting I think is privileging a docker container (running , say, slurm) to interact with the host docker daemon and launch other containers in parallel
there's a couple of ways to do that.
Michael L Heuer
@heuermh
Jan 08 2015 22:24
I'm not sure what that means but I will be sending a transcript of this chat to our "devops" guy later this afternoon :)
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:24
ha
it's the difference between having a 'parent' docker container running 'child' docker containers, and a 'first sibling' docker container running 'sibling' containers
as for the rest of your dependencies..
remember, you can choose to either build your docker image on the fly or simply pull a pre-build docker image
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:27
Also Shaun Jackman is distributing linuxbrew as a docker container
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:27
so you could manually build your master docker image with bwa + friends, as well as linuxbrew and internal repo dependencies
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:28
and then push that docker image up to docker hub (or your own private docker registry hosted on S3)
Michael L Heuer
@heuermh
Jan 08 2015 22:29
yeah I imagine we would want to prebuild the docker image for reproducibility; we're using "containers" at the AMI level for that at this point
Andrew Stewart
@andrewcstewart
Jan 08 2015 22:30
the alternative is to author a Dockerfile that does all of that stuff
Michael L Heuer
@heuermh
Jan 08 2015 22:38
thanks, looks I have a lot to read
jasonbrelsford
@jasonbrelsford
Jan 08 2015 22:42
Hey all.
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:43
Hi
Michael L Heuer
@heuermh
Jan 08 2015 22:43
@jasonbrelsford can you scroll back? all the answers to our problems are here :)
Paolo Di Tommaso
@pditommaso
Jan 08 2015 22:43
ahahah
no, they are simply new problems ;0
jasonbrelsford
@jasonbrelsford
Jan 08 2015 22:50
ah yes. I'm seeing this and have some testing to do.
Jeremy Anderson
@andersje
Jan 08 2015 22:53
woot, day late and a dollar short, but I'm here :)
sorry, walked to the library with the youngest to pick up some books.
Jeremy Anderson
@andersje
Jan 08 2015 23:03
heuermh: how exactly are we using containers already?
Michael L Heuer
@heuermh
Jan 08 2015 23:05
we're not exactly, right? you might want to scroll back to catch up
now that we've taken over this chat room :)
Jeremy Anderson
@andersje
Jan 08 2015 23:07
i did, but what threw me was trying to associated "containers" with anything we do -- which is just clone an instance from an AMI, and then enforce package installation from an apt mirror, using puppet.
because I thought container was a pretty docker-specific term.
Michael L Heuer
@heuermh
Jan 08 2015 23:08
that's why containers was in quotes . . . our container per se is the AMI
Jeremy Anderson
@andersje
Jan 08 2015 23:08
right.
it also looks like we'll have to build OS-specific containers.
just as we already build OS-specific .debs
Michael L Heuer
@heuermh
Jan 08 2015 23:10
if I have this right, what we might want to try to do is 1) create an AMI with only slurm, nextflow, and the docker host 2) configure /mnt/common appropriately 3) create a docker image with bwa, samtools, ngs-tools, etc. in it and 4) use nextflow to spin up docker images through slurm to run the processes in the nextflow pipeline
Jeremy Anderson
@andersje
Jan 08 2015 23:11
that sounds about right.
as i understand it at least.
Michael L Heuer
@heuermh
Jan 08 2015 23:11
oh ok, let's do it then :)
Jeremy Anderson
@andersje
Jan 08 2015 23:11
sure thing, after we build the new puppet infrastructure for dash 2.0
Michael L Heuer
@heuermh
Jan 08 2015 23:12
I assume that works off the linuxbrew docker image linked above, right?
Jeremy Anderson
@andersje
Jan 08 2015 23:12
#1 will be easy, since we can leverage puppet to do all those things.
it could
Michael L Heuer
@heuermh
Jan 08 2015 23:12
sry, that was a joke
Paolo Di Tommaso
@pditommaso
Jan 08 2015 23:12
Even if nextflow can pull the container automatically, I would suggest to pull the container in the nodes before launch the pipeline execution
Jeremy Anderson
@andersje
Jan 08 2015 23:13
right now, jason is just starting to spin up the new puppet server. Then we need to build an AMI that can ssh to the puppet server, and stick it's own public IP address into a file, so that the puppet server can add that public IP to the security group allowing it to talk to the puppet daemon.
yeah, I was figuring we'd push all the relevant containers to nodes via puppet as well. Then the pipeline can just start whichever containers are necessary.
Paolo Di Tommaso
@pditommaso
Jan 08 2015 23:14
it sounds good.
Jeremy Anderson
@andersje
Jan 08 2015 23:14
I still think we're ahead to limit the number of available pieces of software we put on our nodes to just the stuff we need.
Michael L Heuer
@heuermh
Jan 08 2015 23:17
yeah but that list is still quite long, especially when I add the variant calling and annotation parts
Jeremy Anderson
@andersje
Jan 08 2015 23:17
:(
Andrew Stewart
@andrewcstewart
Jan 08 2015 23:21
btw.. it's probably good to differentiate between docker images and containers
but basically... docker images : EC2 AMIs :: docker containers : EC2 instances
Jeremy Anderson
@andersje
Jan 08 2015 23:21
Is there a quick glossary of all these things?
Andrew Stewart
@andrewcstewart
Jan 08 2015 23:21
Jeremy Anderson
@andersje
Jan 08 2015 23:21
oh, awesome.
Andrew Stewart
@andrewcstewart
Jan 08 2015 23:22
I throw that graphic into every slide deck nowadays
@pditommaso quick question: I see both $foobar and ${foobar} in process blocks. Is the difference that $foorbar is a val and ${foobar} is getting the name of a file ?
Jeremy Anderson
@andersje
Jan 08 2015 23:24
that's good. I don't know what webpage I was reading before, but docker.com is very straight forward. I don't know why I didn't realize it was just equivalent to a zone before.
Michael L Heuer
@heuermh
Jan 08 2015 23:24
sweet, thanks. you do any consulting? ;)
Andrew Stewart
@andrewcstewart
Jan 08 2015 23:28
Sometimes
Michael L Heuer
@heuermh
Jan 08 2015 23:30
well let me just say that the weather here in Minnesota, USA right now is quite nice if you're up for travel
Jeremy Anderson
@andersje
Jan 08 2015 23:31
it's lovely, really.
Andrew Stewart
@andrewcstewart
Jan 08 2015 23:31
San Francisco :)
Michael L Heuer
@heuermh
Jan 08 2015 23:31
Winter Weather Advisory is in effect until January 9, 12:00 AM CST
Wind Chill Advisory in effect from January 9, 12:00 AM CST until January 9, 12:00 PM CST
Hazardous Weather Outlook is in effect
windchills of -50F yesterday
SF would be lovely
Jeremy Anderson
@andersje
Jan 08 2015 23:33
on the upside, the weather here keeps the vast majority of the riff raff out.
Michael L Heuer
@heuermh
Jan 08 2015 23:35
that must be why I'm itching to leave
Andrew Stewart
@andrewcstewart
Jan 08 2015 23:36
whats your org?
Michael L Heuer
@heuermh
Jan 08 2015 23:37
@andersje and I consult to Be The Match/NMDP http://bethematch.org/ ; @jasonbrelsford is an employee
Andrew Stewart
@andrewcstewart
Jan 08 2015 23:39
oh how cool
I was working with an organization this past fall that partners with various nonprofits and one of their projects was matching blood donors
Michael L Heuer
@heuermh
Jan 08 2015 23:42
we're interested in those parts of the genome that most groups have in their exclusion BED file (i.e. the MHC on chr6 and KIR on chr19) for matching; recruitment is also a big r&d concern
thank you @pditommaso and @andrewcstewart for your time; I need to get bundled up for the walk home from the coffee shop
Andrew Stewart
@andrewcstewart
Jan 08 2015 23:46
You're welcome
Jeremy Anderson
@andersje
Jan 08 2015 23:58
yes, thank you for clearing up my misconceptions :)
so this will be archived...is there an easy way to copy the transcript, complete with usernames?