These are chat archives for nextflow-io/nextflow

2nd
Nov 2016
Denis Moreno
@Galithil
Nov 02 2016 09:14
Hello again @pditommaso , I am trying to understand what path is mounted by default by Nextflow when using a Docker container image. I saw that I could use temp or runOptions to pass my options to docker, but I would like to know what autodoes exactly.
Paolo Di Tommaso
@pditommaso
Nov 02 2016 09:47
the option temp allows you to specify a host directory has /tmp in the container
when using temp = 'auto' or temp = true (that's the same) it creates unique temp directory for you
Denis Moreno
@Galithil
Nov 02 2016 10:13
Isnt that how the input files and the output files are handled ?
I thought it was mounting the current working directory
Paolo Di Tommaso
@pditommaso
Nov 02 2016 10:22
It mounts the current working dir + any other path needed to access the process input files
mounting the temp generally is not needed, you can mount /tmp in a host folder if a specific app heavily use it
Denis Moreno
@Galithil
Nov 02 2016 10:26
I see
I'll probably need to mount the path of my reference genomes
sounds good, thanks
Paolo Di Tommaso
@pditommaso
Nov 02 2016 10:27
you should not need that
the ref genome are not declared as an input file?
Denis Moreno
@Galithil
Nov 02 2016 10:40
I don't know that much, sorry, I'm just a lowly dev
asking @ewels
Paolo Di Tommaso
@pditommaso
Nov 02 2016 10:41
in principle (and also in practice) you only need to specify what container to use, you don't need to mount extra paths in the container
Phil Ewels
@ewels
Nov 02 2016 10:45
Yes, the ref genome is declared as an input in the process: https://github.com/SciLifeLab/NGI-RNAseq/blob/master/main.nf#L211
Paolo Di Tommaso
@pditommaso
Nov 02 2016 10:46
good, thus the container use should be transparent
Denis Moreno
@Galithil
Nov 02 2016 10:47
okay then. Thanks.
Phil Ewels
@ewels
Nov 02 2016 10:47
:tada:
Paolo Di Tommaso
@pditommaso
Nov 02 2016 10:48
@ewels one question, have you ever met this problem nextflow-io/nextflow#234 ?
in a nutshell, in order to free resources for jobs with higher priority, slurm may suspend some jobs killing and restarting them later
Phil Ewels
@ewels
Nov 02 2016 10:50
Wow, that's horrible
I'm not aware of our cluster doing that, but that's not to say that it doesn't happen
Paolo Di Tommaso
@pditommaso
Nov 02 2016 10:50
I agree .. :/
Denis Moreno
@Galithil
Nov 02 2016 10:50
Well, we don't really handle priorities that way
Paolo Di Tommaso
@pditommaso
Nov 02 2016 10:51
I'm asking that because I've noticed you are handling 143 error code
Phil Ewels
@ewels
Nov 02 2016 10:51
Yup - I asked our sysadmins about the different error codes and they were not super clear about exactly what they meant
We tried to test it out manually by setting ridiculously small requirements and came to the conclusion that 143 was what we got when jobs ran out of memory or time
Paolo Di Tommaso
@pditommaso
Nov 02 2016 10:53
anyhow I guess when a task uses too many resources it's killed with 143
Phil Ewels
@ewels
Nov 02 2016 10:53
The resubmission thing seems to work in our hands
Yup, exactly
Paolo Di Tommaso
@pditommaso
Nov 02 2016 10:53
ok, yes
thanks
Phil Ewels
@ewels
Nov 02 2016 10:54
I wasn't aware of the pending / resubmission behaviour though.
It's possible that it happens very occasionally and could be behind some of the odd "didn't work, reran and it worked" runs that you get every now and then
Paolo Di Tommaso
@pditommaso
Nov 02 2016 10:54
neither I was
also it seems that the person that opened that issue was heavily inspired by your pipeline
Phil Ewels
@ewels
Nov 02 2016 10:56
We will try to keep an eye out for failed runs which fit this pattern, though to be honest our job priorities are typically pretty static, so I don't think it will happen much / at all
haha, yes I thought it looked a bit familiar.
Paolo Di Tommaso
@pditommaso
Nov 02 2016 10:57
:)
Phil Ewels
@ewels
Nov 02 2016 10:57
I think we took the retry thing from your docs though didn't we?
It was a while ago we wrote that..
Paolo Di Tommaso
@pditommaso
Nov 02 2016 10:58
yes but the use of
  cpus { 1 } 
  maxErrors '-1'
makes me think it's inspired from your code
BTW you can use cpus 1 instead of cpus { 1 }
Phil Ewels
@ewels
Nov 02 2016 10:59
Yup. The quoted maxErrors thing was from a discussion with you as I remember
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:00
also you could externalise all the declaration in a config file
Phil Ewels
@ewels
Nov 02 2016 11:00
Aha, detective @Galithil has just discovered that our SLURM cluster doesn't do preemption, we think :shipit:
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:00
good
Phil Ewels
@ewels
Nov 02 2016 11:01
Externalise? You mean put it in a function in the config file or something?
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:01
yes
Phil Ewels
@ewels
Nov 02 2016 11:01
(apologies, I'm a self confessed groovy / java n00b)
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:01
:)
Phil Ewels
@ewels
Nov 02 2016 11:01
ok, that would be nice
Denis Moreno
@Galithil
Nov 02 2016 11:02
confirmed
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:02
for example that would avoid to repeat errorStrategy { task.exitStatus == 143 ? 'retry' : 'terminate' } for each process
Phil Ewels
@ewels
Nov 02 2016 11:02
+1 for removing repetitive code
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:03
I can send you a PR to show you how it works
Phil Ewels
@ewels
Nov 02 2016 11:03
haha, perfect - I was about to start asking questions but that would be much quicker
super helpful, thank you!
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:03
:+1:
Phil Ewels
@ewels
Nov 02 2016 11:03
Whilst we're chatting - do you know what happens with environment module load statements when there is no env module system? Are they just ignored?
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:04
we made NF for exactly this, enabling sharing !
Phil Ewels
@ewels
Nov 02 2016 11:04
Thinking about the docker work that @Galithil is talking above
Absolutely! Our RNA pipeline itself was initially inspired by someone else's STAR pipeline
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:05
Whilst we're chatting - do you know what happens with environment module load statements when there is no env module system? Are they just ignored?
No. it would cause an error
Denis Moreno
@Galithil
Nov 02 2016 11:05
ha.
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:05
but you can move also that in a config file
Denis Moreno
@Galithil
Nov 02 2016 11:05
+1 for sane behaviour
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:06
you can even manage different config profiles, eg one using module and the other for Docker
Phil Ewels
@ewels
Nov 02 2016 11:07
that's exactly what we want :+1:
Also being able to specify different module names, so that people on different clusters can use the pipeline
eg. One cluster calls it MultiQC, the other calls it multiqcetc
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:07
exactly
which version can I use for the PR
or
?
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:08
great
Phil Ewels
@ewels
Nov 02 2016 11:08
I've done that by accident a few times, should have created it there first and forked it so that it's obvious which is the head repo :(
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:09
:)
Phil Ewels
@ewels
Nov 02 2016 11:15
Any best practice recommendations for reference genomes with docker?
Seems like a bad idea to include them in the image because of huge filesizes
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:15
yes, I never include data in the docker image
Phil Ewels
@ewels
Nov 02 2016 11:16
so just leave it up to the user to provide alongside data?
that's pretty fair I guess
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:16
yes
my best practice is to include in the github repo a small test dataset
so you can download and run it
Denis Moreno
@Galithil
Nov 02 2016 11:17
you still need the references genomes, though
Phil Ewels
@ewels
Nov 02 2016 11:17
:+1:
Paolo Di Tommaso
@pditommaso
Nov 02 2016 11:17
that's very useful to test your workflows with Travis or CircleCI
you still need the references genomes, though
you can create a small fake one
Denis Moreno
@Galithil
Nov 02 2016 11:19
Probably, but @ewels would have to do it ;)
Rickard Hammarén
@Hammarn
Nov 02 2016 11:21
didn't you make your own STAR index in a few seconds? @Galithil
Denis Moreno
@Galithil
Nov 02 2016 11:23
I still have no idea what I did, but apparently, yep.
Phil Ewels
@ewels
Nov 02 2016 14:13
@pditommaso - one thing that bugs me with Nextflow and that I / others frequently get wrong when launching our pipeline is the --reads '*_R{1,2}.fastq.gz' syntax
Specifically, having to wrap the filename in quotes
But also messing up the paired end matching sometimes
My other pipeline tool has a builtin function to do the same job of pairing up input files which it does in a different way
Essentially it lines up the input filenames alphabetically and strips out _R?[1-4] from the filename. If there are then pairs with identical names it returns them in a pair. Otherwise things come back as single end
It's a stronger assumption that can obviously be wrong sometimes, but it works smoothly 99% of the time and means that you can be a lot lazier when launching the pipeline as you can just do --reads *fastq.gz
Do you think it would be possible for us to write something similar in Nextflow?
And is there any reason why we definitely shouldn't try to? ;)
Félix C. Morency
@fmorency
Nov 02 2016 15:27
Stupid question: how can I do something like val min, val max from 0.1, 0.9
Johan Viklund
@viklund
Nov 02 2016 15:53
@pditommaso looked though the PR for the RNAseq pipeline, why are there $ signs in the profile module specification?
Paolo Di Tommaso
@pditommaso
Nov 02 2016 20:48
@fmorency Like the following
process foo {
  input: 
  set val(x), val(y) from Channel.value([1,2]) 

  """
  echo $x $y
  """
}
Félix C. Morency
@fmorency
Nov 02 2016 20:48
i wasn't far :P
Paolo Di Tommaso
@pditommaso
Nov 02 2016 20:48
:)
Félix C. Morency
@fmorency
Nov 02 2016 20:49
thanks
Paolo Di Tommaso
@pditommaso
Nov 02 2016 20:49
:ok_hand:
@ewels I found that annoying as well, but the quotes are needed because otherwise the glob is expanded by the bash interpreter. Anyhow please open a feature request and continue the discussion there.
Phil Ewels
@ewels
Nov 02 2016 20:53
Yes, we found the same thing. I'll open an issue - thanks!
Paolo Di Tommaso
@pditommaso
Nov 02 2016 21:17
@viklund The $ is needed to reference a process name in place of a directive name. See here.