These are chat archives for nextflow-io/nextflow

7th
Nov 2016
Phil Ewels
@ewels
Nov 07 2016 12:33
Hi @pditommaso! @Galithil and us are trying to port our pipeline into Docker as you know - it's mostly working now, apart from the final step.. This process: https://github.com/SciLifeLab/NGI-RNAseq/blob/master/main.nf#L650-L677
In our docker image, we're getting results folders that contain a single file, which has multiple filenames printed in it:
$ cat dupradar/input.6 

[/Users/denismoreno/Documents/nextflow_test/work/ec/383990f26be96a94e119b70aa5619d/ERR458837_testAligned.sortedByCoord.out.bam.markDups.bam_dupMatrix.txt, /Users/denismoreno/Documents/nextflow_test/work/ec/383990f26be96a94e119b70aa5619d/ERR458837_testAligned.sortedByCoord.out.bam.markDups.bam_duprateExpBoxplot.pdf, /Users/denismoreno/Documents/nextflow_test/work/ec/383990f26be96a94e119b70aa5619d/ERR458837_testAligned.sortedByCoord.out.bam.markDups.bam_duprateExpDens.pdf, /Users/denismoreno/Documents/nextflow_test/work/ec/383990f26be96a94e119b70aa5619d/ERR458837_testAligned.sortedByCoord.out.bam.markDups.bam_duprateExpDensCurve.txt, /Users/denismoreno/Documents/nextflow_test/work/ec/383990f26be96a94e119b70aa5619d/ERR458837_testAligned.sortedByCoord.out.bam.markDups.bam_expressionHist.pdf, /Users/denismoreno/Documents/nextflow_test/work/ec/383990f26be96a94e119b70aa5619d/ERR458837_testAligned.sortedByCoord.out.bam.markDups.bam_intercept_slope.txt]
Instead of just having the files themselves linked into that folder, as we'd expect
what's strange is that this seems to work fine when we run it normally on our cluster
any ideas what could be going on here?
Evan Floden
@evanfloden
Nov 07 2016 12:45
My first guess would be that the dupradar_results channel does not contain what is expected. It could be that a list of files is beings passed to the input of the last process (which is converted to a list) and not a channel containing files (which are converted to a list). Not sure how/why this would be docker specific though.
Phil Ewels
@ewels
Nov 07 2016 12:48
That's what's confusing me - I'm pretty sure that this works fine on our normal cluster
Denis Moreno
@Galithil
Nov 07 2016 13:01
found out what the issue was
our last step tries to open all the files in the CWD, one of them is a named pipe (.command.pe), so it hangs waiting for a chance to open it. I'm assuming this pipe is a nextflow thing ?
Paolo Di Tommaso
@pditommaso
Nov 07 2016 13:34
@Galithil Yes, all .command.* files are generated by nextflow, is that the cause the problem ?
Paolo Di Tommaso
@pditommaso
Nov 07 2016 13:39
@ewels do you mean that the content of dupradar/input.6 is the following ?
[/Users/denismoreno/Documents/nextflow_test/work/ec/383990f26be96a94e119b70aa5619d/ERR458837_testAligned.sortedByCoord.out.bam.markDups.bam_dupMatrix.txt, /Users/denismoreno/Documents/nextflow_test/work/ec/383990f26be96a94e119b70aa5619d/ERR458837_testAligned.sortedByCoord.out.bam.markDups.bam_duprateExpBoxplot.pdf, /Users/denismoreno/Documents/nextflow_test/work/ec/383990f26be96a94e119b70aa5619d/ERR458837_testAligned.sortedByCoord.out.bam.markDups.bam_duprateExpDens.pdf, /Users/denismoreno/Documents/nextflow_test/work/ec/383990f26be96a94e119b70aa5619d/ERR458837_testAligned.sortedByCoord.out.bam.markDups.bam_duprateExpDensCurve.txt, /Users/denismoreno/Documents/nextflow_test/work/ec/383990f26be96a94e119b70aa5619d/ERR458837_testAligned.sortedByCoord.out.bam.markDups.bam_expressionHist.pdf, /Users/denismoreno/Documents/nextflow_test/work/ec/383990f26be96a94e119b70aa5619d/ERR458837_testAligned.sortedByCoord.out.bam.markDups.bam_intercept_slope.txt]
weird !
Paolo Di Tommaso
@pditommaso
Nov 07 2016 13:50
@ewels ok, I think I've understood what's happening
The tricky point is this output
 file '*.{pdf,txt}' into dupradar_results
since that captures two files, they are emitted a pair, thus dupradar_results produces a sequences of pairs
as a consequence of that dupradar_results.toList() returns a list of pairs instead of a list of files
replacing that with dupradar_results.flatten().toList()should work as expected
I would suggest to do the same for all the inputs declared in the multiqc step
Phil Ewels
@ewels
Nov 07 2016 14:14
Hi @pditommaso, sorry - was in a meeting. Thanks for the tip! I'm not sure that can be the only reason though, as we see the same thing for other processes, eg. featurecounts
[/Users/denismoreno/Documents/nextflow_test/work/9e/519c073e63f7fe033dd7292b7c957c/ERR458837_testLog.final.out, /Users/denismoreno/Documents/nextflow_test/work/9e/519c073e63f7fe033dd7292b7c957c/ERR458837_testLog.out, /Users/denismoreno/Documents/nextflow_test/work/9e/519c073e63f7fe033dd7292b7c957c/ERR458837_testLog.progress.out]root@ca5c1a233976:/Users/denismoreno/Documents/nextflow_test/work/0e/473081017
Though I guess the process does capture multiple files, though it doesn't have the squiggly brackets.
Any idea why this would happen in one environment (docker) and not another (our cluster)?
Also - the pipe thing is a separate problem I think. Open pipes making MultiQC hang. Should be able to fix that in the MultiQC code fairly easily though.
Paolo Di Tommaso
@pditommaso
Nov 07 2016 14:23
@ewels I'm quite sure that you need a flatten there
Phil Ewels
@ewels
Nov 07 2016 14:23
ok, sounds good - I'll add that.
Paolo Di Tommaso
@pditommaso
Nov 07 2016 14:24
I'm surprised that it works when using the cluster though
you can see quite easily the difference using this snippet
process foo {
    input: 
    each x from (['a','b','c'])
    output:
    file '*.{pdf,txt}' into dupradar_results

    script:
    """
    touch ${x}.pdf
    touch ${x}.txt
    """
}

process bar {
   echo true
   input:
   file all from dupradar_results.flatten().toList()

   """
   echo $all
   """
}
Phil Ewels
@ewels
Nov 07 2016 14:30
:+1:
Denis Moreno
@Galithil
Nov 07 2016 15:39
I'm mildly upset by the fact that nextflow has a -profile option. I like long option names to be prefixed with two dashes.
Paolo Di Tommaso
@pditommaso
Nov 07 2016 15:39
:)
me too
Stian Soiland-Reyes
@stain
Nov 07 2016 15:40
I'm guilty of that in Taverna as well.. we have -inputvalue :-(
Paolo Di Tommaso
@pditommaso
Nov 07 2016 15:41
the set up the club of single hash option guilty :)
anyhow the rationale behind is that -opt are used by nextflow
instead --opt are user pipeline options
there was an idea to switch to a GNU complaint command line format
Stian Soiland-Reyes
@stain
Nov 07 2016 15:43
makes perfect sense :) Then any -- is free, right?
Denis Moreno
@Galithil
Nov 07 2016 15:43
I heard that. I still don't like it :)
Paolo Di Tommaso
@pditommaso
Nov 07 2016 15:44
yes, a possibility could be to use -- to separate NF options to user options e.g.
nextflow run foo --profile bar -- --opt1 --opt2
but I find it a bit verbose
what do you think ?
Stian Soiland-Reyes
@stain
Nov 07 2016 15:45
I think it's more geeky points, but perhaps less useful for a typical Nextflow user?
Denis Moreno
@Galithil
Nov 07 2016 15:45
As an end user, knowing where options go do not make much difference for me. I seem to be the only one though :P
Phil Ewels
@ewels
Nov 07 2016 15:46
I quite like the current setup, it makes sense when you're writing the pipelines at least
I understand how it's confusing for those who are just running them though
Not sure that the -- separator would help that situation though
Maxime Garcia
@MaxUlysse
Nov 07 2016 15:47
same thing as @ewels , I think the -- would not help at all
Denis Moreno
@Galithil
Nov 07 2016 15:47
I'm partial to having options applying to nextflow straight after nextflow and all the user options after foo
but positionning usually ends in headaches as well
Paolo Di Tommaso
@pditommaso
Nov 07 2016 15:48
but positionning usually ends in headaches as well
yes very annoying, you find yourself to have to jump up and down the command line
Evan Floden
@evanfloden
Nov 07 2016 15:49
Yeah I agree with @ewels @MaxUlysse, location dependent flags/parameters are not cool.
Phil Ewels
@ewels
Nov 07 2016 15:49
How about just having everything use two dashes? Parse core options first, then whatever is left is sent to params? Would mean that you couldn't have params.c, but anyone who writes a pipeline with variable names identical to core nextflow parameters deserves to feel guilty anyway
Paolo Di Tommaso
@pditommaso
Nov 07 2016 15:50
um, then when adding a new NF option you could potentially break existing pipelines
Phil Ewels
@ewels
Nov 07 2016 15:50
yeah, that's a good point
Stian Soiland-Reyes
@stain
Nov 07 2016 15:50
exactly.. not very sustainable. I like the split.
Denis Moreno
@Galithil
Nov 07 2016 15:50
or name them nextflow.profile
Paolo Di Tommaso
@pditommaso
Nov 07 2016 15:50
what do u mean ?
Denis Moreno
@Galithil
Nov 07 2016 15:51
if you prefix all of the core nextflow options with nf- there's a reduced risk of collision with user pipelines
makes longer options though.
also, compatibility break right there
Stian Soiland-Reyes
@stain
Nov 07 2016 15:52
you can let -profile work the same as --nf-profile, no compatibility problem unless someone goes backwards
Paolo Di Tommaso
@pditommaso
Nov 07 2016 15:52
nextflow run foo --nf-profile bar --opt1
it could be an option
but I think there's consensus that isn't an urgent issue
Denis Moreno
@Galithil
Nov 07 2016 16:11
I don't really mind as long as I'm allowed to lash out against it on slack.
Phil Ewels
@ewels
Nov 07 2016 16:25
@Galithil just managed to run our full pipeline on a test dataset using docker for the first time :tada:
Thanks for the help @pditommaso !
Paolo Di Tommaso
@pditommaso
Nov 07 2016 16:30
great!
looking forward to running it on my laptop
:)
do you have a docker in your cluster ?
Phil Ewels
@ewels
Nov 07 2016 16:32
hah, fingers crossed! We haven't pushed the image or work anywhere yet, so not available yet
No, we can't run docker on our main cluster but we have a small test cloud cluster environment that's been set up which we'll try to test on
Paolo Di Tommaso
@pditommaso
Nov 07 2016 16:32
:+1:
Phil Ewels
@ewels
Nov 07 2016 16:32
Max 16GB memory there though I think :\
Might also see if we can free up the bosses' credit cards for a proof of principle AWS test.
Paolo Di Tommaso
@pditommaso
Nov 07 2016 16:33
:)
anyhow we are testing Singularity, an HPC friendly container engine
you should give it a try if you want to run containers in your cluster
Phil Ewels
@ewels
Nov 07 2016 16:35
Nice! Unfortunately we don't have direct control over our cluster, it's a national resource administered by a separate facility which makes this kind of thing pretty difficult
I'll keep hold of the link though, it could certainly come in handy in future discussions..
Paolo Di Tommaso
@pditommaso
Nov 07 2016 16:35
same here for the BSC, but when we asked to install it they had no objection
it's not invasive as Docker
Phil Ewels
@ewels
Nov 07 2016 16:37
ok nice! we're with UPPMAX.. Does Singularity require sudo access and other non-shared-environment stuff?
Paolo Di Tommaso
@pditommaso
Nov 07 2016 16:37
nope
the only requirement is that the binary it's owned by root, but the user does not need special permission
Denis Moreno
@Galithil
Nov 07 2016 16:39
I ran it on my laptop
that's the most reliable test machine we have nowadays
(was talinkg about ngi-rna, not singularity)
Paolo Di Tommaso
@pditommaso
Nov 07 2016 16:39
do you mean docker ?
ok
Denis Moreno
@Galithil
Nov 07 2016 16:39
yep
Paolo Di Tommaso
@pditommaso
Nov 07 2016 16:39
cool
next step, setup a CI server for it :)
Denis Moreno
@Galithil
Nov 07 2016 16:41
are there a lot of ci servers that give out 16 gb ram for free ?
Paolo Di Tommaso
@pditommaso
Nov 07 2016 16:42
I use circleci.com, but the limit is 4GB
but I think you need 16G for real data, my suggestion is to create a small test dataset
Phil Ewels
@ewels
Nov 07 2016 16:43
Yeah we did that already, test dataset with a reference just for chr22
Hopefully will fit onto 4GB
(STAR aligner is hungry for memory)
Paolo Di Tommaso
@pditommaso
Nov 07 2016 16:43
I see
Phil Ewels
@ewels
Nov 07 2016 16:46
We'll get it to work anyway :)
Drafting an e-mail about Singularity now.. ;)
Paolo Di Tommaso
@pditommaso
Nov 07 2016 16:49
cool
Jason Byars
@jbyars
Nov 07 2016 20:56
If I am using the storeDir option with a process, is there anywhere in the logs that will show what file was missing that triggered a job resubmission when I rerun the pipeline? I have one process in a pipeline resubmitting jobs that already appear to have results in the storeDir folder. It doesn't resubmit all jobs, just a handful and I need a clue why.
Félix C. Morency
@fmorency
Nov 07 2016 21:21
we can't use ${task.process} in output:?
it seems to break -resume
Paolo Di Tommaso
@pditommaso
Nov 07 2016 21:54
@jbyars unfortunately no, I think you can get some extra info turning on tracing mode
@fmorency task.xxx variables are only available in the process script block
Félix C. Morency
@fmorency
Nov 07 2016 21:56
well... it works, just not with -resume
Paolo Di Tommaso
@pditommaso
Nov 07 2016 21:58
but it's not supposed to work ..
you u can use at your risk :/
Félix C. Morency
@fmorency
Nov 07 2016 22:00
will fix
Jason Byars
@jbyars
Nov 07 2016 22:54
@pditommaso @fmorency thanks for the idea, the bug is resolved. Best I can tell the issue was related to the controller node not having a local copy of one of the reference files for a process that exists on the workers. For some of the jobs this didn't seem to matter, for others it generated an unable to hash file x error. I will see if I can make an example of this later.