These are chat archives for nextflow-io/nextflow

25th
Jan 2017
Fredrik Boulund
@boulund
Jan 25 2017 11:59
What is the best practice if I want a fairly large reference database split across several files transferred to the node scratch dir with the scratch = true and stageInMode = 'copy' settings active? Should I put it in a tarball and declare the tarball an input file, then unpack the tarball as part of the process script, or is there a better way built into nextflow?
Paolo Di Tommaso
@pditommaso
Jan 25 2017 12:01
I would use the default stage-in ie. symlinks
Fredrik Boulund
@boulund
Jan 25 2017 12:03
and just declare each and every one of the 44 db components in the process input declaration?
The thing is, the process is going to do a lot of random access to the DB, so I'd prefer to have it locally on each node, to relieve the shared file system of stress if possible
Paolo Di Tommaso
@pditommaso
Jan 25 2017 12:05
if u think that there are heavily random accessed it may have sense to copy them locally
Fredrik Boulund
@boulund
Jan 25 2017 12:06
but still the best way is to list all of the files in the input declarations? is there perhaps a way to specify the list of files as a list in a config file somewhere, and have the entire list copied over to scratch?
Paolo Di Tommaso
@pditommaso
Jan 25 2017 12:06
if they have a common file name pattern, you can use a file input declaration using a glob pattern
Fredrik Boulund
@boulund
Jan 25 2017 12:07
it's a fairly complex structure unfortunately, some subdirs, and lots of files
Paolo Di Tommaso
@pditommaso
Jan 25 2017 12:08
you can still provide a list a from argument
I mean a channel providing a list of files
Fredrik Boulund
@boulund
Jan 25 2017 12:16
I'm not sure I understand how you mean.
Paolo Di Tommaso
@pditommaso
Jan 25 2017 12:18
something like
x = Channel.from('a','b','c').map { file(it) } .toList()

process foo {
  input:
  file all from x
  """
  echo $all
  """
}
Fredrik Boulund
@boulund
Jan 25 2017 12:19
Ok, then that would stage in files a, b, and c to the scratch dir, right?
Paolo Di Tommaso
@pditommaso
Jan 25 2017 12:19
yes
Fredrik Boulund
@boulund
Jan 25 2017 12:20
This sounds like a viable solution
Paolo Di Tommaso
@pditommaso
Jan 25 2017 12:20
:+1:
Fredrik Boulund
@boulund
Jan 25 2017 12:41
one more thing sprung to mind: will this method actually preserve directory structure?
probably not, right?
Paolo Di Tommaso
@pditommaso
Jan 25 2017 12:42
nope
Fredrik Boulund
@boulund
Jan 25 2017 12:43
I guess I'll make a list per subdir then, and recreate the dir structure in the scratch dir in the process description
Paolo Di Tommaso
@pditommaso
Jan 25 2017 12:44
but you can just reference the top dir this case and it will copy all the content
must easier .. !
Fredrik Boulund
@boulund
Jan 25 2017 12:44
ah, you can??
Paolo Di Tommaso
@pditommaso
Jan 25 2017 12:44
yep
Fredrik Boulund
@boulund
Jan 25 2017 12:44
yeah, definitely much easier!
do I then make a channel of the base dir ?
Paolo Di Tommaso
@pditommaso
Jan 25 2017 12:46
as easy as that
process foo {
  input:
  file dir from file('/your/db/dir')

  """
  echo $dir
  """
}
Fredrik Boulund
@boulund
Jan 25 2017 12:47
nice!
thanks a lot for your help, really appreciate it!
Paolo Di Tommaso
@pditommaso
Jan 25 2017 12:52
welcome!
Félix C. Morency
@fmorency
Jan 25 2017 16:05
oooh that's coool
Paolo Di Tommaso
@pditommaso
Jan 25 2017 16:05
what?
Félix C. Morency
@fmorency
Jan 25 2017 16:06
that if you reference a file('/path/to/dir') it will copy all the content
Paolo Di Tommaso
@pditommaso
Jan 25 2017 16:06
ok, sure
Anthony Underwood
@aunderwo
Jan 25 2017 16:34
Hi. I've been in touch before about running nextflow as a workflow manager for pipelines in a large routine microbiology setting.
@pditommaso I believe you've been invited by the organising committee of the Applied Bioinformatics and Public Health Microbiology to give an invited talk. Very much looking forward to hearing yousince reproducible workflows in microbiology in the UK and across Europe is becoming an issue
One question I'm interested in is whether nextflow will be CWL compatible since this seems to be a growing standard
Anthony Underwood
@aunderwo
Jan 25 2017 16:40
Also read with interest your blog post on Singularity. How do you think this fits with cloud computing (e.g OpenStack) as opposed to a bare metal cluster
Anthony Underwood
@aunderwo
Jan 25 2017 16:57
I've tried Nextflow on the CLI and our UGE cluster and it works beautifully :)
Paolo Di Tommaso
@pditommaso
Jan 25 2017 18:10
@aunderwo Hi Antony, I'm going give a talk about reproducibility and NF at that conference. I will be happy to meet you there.
Regarding CWL there are some ideas to support it but unfortunately we don't have yet a schedule. We are an academic project, so we have to focus on main priorities for our research.
Paolo Di Tommaso
@pditommaso
Jan 25 2017 18:19
Singularity works surely better in the context of HPC clusters commonly available to research institutes for the reasons I've explained in the blog post.
It can be used also in public clouds, but in that context I don't see particular advantage to Docker.
Paolo Di Tommaso
@pditommaso
Jan 25 2017 18:24
Then if you a referring to an OpenStack private cloud, it depends your specific requirements. It must be noted that Singularity is developing an hub with very similar functionalities to Docker