These are chat archives for nextflow-io/nextflow

8th
Jun 2016
also @pditommaso you wanted to see the flowchart (very basic VC pipeline, not even any filtering!):
https://raw.githubusercontent.com/bionode/gsoc16/716b9ce1d3458318b1e5dc9aa84d649926706131/pipelines/with-nextflow/flowchart.png
Rickard Hammarén
@Hammarn
Jun 08 2016 07:00
@pditommaso The new snapshot seem to have fixed the problem! :+1:
Paolo Di Tommaso
@pditommaso
Jun 08 2016 07:24
great
Hugues Fontenelle
@huguesfontenelle
Jun 08 2016 10:38

Hi! Trying to use:

if (NXF_DEBUG != null) {println "Debug active"}

but it throws Unknown variable 'NXF_DEBUG'
which was exactly what I was trying to check for.

Paolo Di Tommaso
@pditommaso
Jun 08 2016 10:39
indeed is not defined by default
Hugues Fontenelle
@huguesfontenelle
Jun 08 2016 10:40
well yes, but am I not supposed to be able to check if variables exist in Groovy?
(alternatively my question becomes: how do I check..? :-)
Paolo Di Tommaso
@pditommaso
Jun 08 2016 10:42
nextflow it allows you to access to an env variable but it must exist otherwise you will get that exception
let me check for a workaround
Hugues Fontenelle
@huguesfontenelle
Jun 08 2016 10:44

Tried

if (typeof(NXF_DEBUG) != 'undefined') {}

but same..

Paolo Di Tommaso
@pditommaso
Jun 08 2016 10:50
You can create an helper method like this
def env(String name) {
  System.getenv().containsKey(name) ? System.getenv(name) : null 
}

println env('PATH')
don't forget you have full access to the underlying Java (and Groovy) API
Hugues Fontenelle
@huguesfontenelle
Jun 08 2016 10:53
looks good, thanks!
BTW the error checking is much improved, telling me where in the code something goes wrong. Also duplicate channels. Good job.
Hugues Fontenelle
@huguesfontenelle
Jun 08 2016 11:26

Sometimes I want to publish files, yet I do not need them in any subsequent process. Can I:

output:
file("*.vcf")

without specifying a channel?

(I used to re-use a "dummy" channel, but reusing seems to be forbidden now)
Paolo Di Tommaso
@pditommaso
Jun 08 2016 11:33
Yes, you can do that
Paolo Di Tommaso
@pditommaso
Jun 08 2016 11:41
BTW the error checking is much improved, telling me where in the code something goes wrong. Also duplicate channels. Good job.
Thanks :)
Paolo Di Tommaso
@pditommaso
Jun 08 2016 13:02
A question for the community, I'm improving the publishDir adding an extra parameter that allows to choose the target published file name dynamically i.e. punctually for each publish by using a closure.
This closure will take the file name as argument and returns a path relative to the publish dir path - or - an absolute path
for example:
Paolo Di Tommaso
@pditommaso
Jun 08 2016 13:08
process foo {
  publishDir '/target/path', xxx: { fileName -> fileName=='something' ? 'newFileName' : '/some/other/fileName'  } 

  '''
  your script
  '''
}
Johan Viklund
@viklund
Jun 08 2016 13:09
if the path ends with a / will it put the file into that without renaming?
Paolo Di Tommaso
@pditommaso
Jun 08 2016 13:10
I would say no.
The question is, what do you think it's the best name for this new options:
Johan Viklund
@viklund
Jun 08 2016 13:10
so no /fq$/ ? "fastqdir/" : "otherdir/"?
Paolo Di Tommaso
@pditommaso
Jun 08 2016 13:10
1) saveTo
2) saveAs
3) other .. propose a different one
Johan Viklund
@viklund
Jun 08 2016 13:13
if it is supposed to specify the whole path including the final file, saveAs, if it is possible to specify a directory as target saveTo.
Paolo Di Tommaso
@pditommaso
Jun 08 2016 13:14
I was thinking the same, the main use case is to provide a mechanism to rename published files
Johan Viklund
@viklund
Jun 08 2016 13:15
got to go
Samuel Lampa
@samuell
Jun 08 2016 13:15
Have been pondering this too (anonymous function for generating file names/paths) ... I found it hard, not sure I came up with any good names ... went with something like "outpathformatter", ugh ...
Paolo Di Tommaso
@pditommaso
Jun 08 2016 13:15
nut at the same time to provide the ability to switch to different target paths if required
@samuell yes, it's very difficult to find intuitive, not repetitive and short names
Samuel Lampa
@samuell
Jun 08 2016 13:16
indeed
saveAs doesn't seem that bad though
similar to how it is named in GUI applications :)
Paolo Di Tommaso
@pditommaso
Jun 08 2016 13:17
yep :)
Paolo Di Tommaso
@pditommaso
Jun 08 2016 13:23
so no /fq$/ ? "fastqdir/" : "otherdir/"?
@viklund yes, but doing
it =~ /fq$/ ? "fastqdir/$it" : "otherdir/$it"?
Johan Viklund
@viklund
Jun 08 2016 13:31
Yes, that works
Maxime Garcia
@MaxUlysse
Jun 08 2016 13:42
Hello,
I'm trying to use the each qualifier in a process to go through different items in a set, and I'm not really successful, can anyone give me some pointers ?
This message was deleted

my process input is:

input:
set val(idPatient), val(idSample), file(realignedBamTable), file(realignedBaiTable) from realignedBam
each sample from idSample

Paolo Di Tommaso
@pditommaso
Jun 08 2016 13:46
it looks fine
what's the problem ?
Maxime Garcia
@MaxUlysse
Jun 08 2016 13:47
I got just one id in idPatient, severals in idSample, and the same numbers of files
Paolo Di Tommaso
@pditommaso
Jun 08 2016 13:47
ah wait
Maxime Garcia
@MaxUlysse
Jun 08 2016 13:47
all this set was made by a groupTuple, with the identifier being the idPatient
Paolo Di Tommaso
@pditommaso
Jun 08 2016 13:48
you cannot have val(idSample) and use the same idSample in the from part of the declaration
the from must be something defined in the scope external to the process
Maxime Garcia
@MaxUlysse
Jun 08 2016 13:49
ok
It's good begining
Thanks a lot
Paolo Di Tommaso
@pditommaso
Jun 08 2016 13:50
yes, the meaning is that you are importing something from the external context
defining some variables in the process context
Maxime Garcia
@MaxUlysse
Jun 08 2016 14:37
Ok, so now I have no problem getting the each qualifier working inside the process
thanks
Phil Ewels
@ewels
Jun 08 2016 14:46
Hi @pditommaso - we have a tool (MultiQC) that creates a summary report at the end of the pipeline by parsing all log files from every step of the pipeline. Currently we have it working by telling it to scan files in the publishDir directory, but this makes me nervous as your docs specifically warn against doing that ;)
Do you have any ideas for how to better do this? Collecting every file from every step as a specific output seems a bit clumsy, plus fiddly when they need to be consumed by other processes and so on.
Currently I'm inclined to stick with the publishDir approach and just hope it doesn't break too often :hushed:
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:03
Hi there, it depends how you do run MultiQC.
If you launch it at the end of the pipeline execution in a separate process obviously there's no problem
Phil Ewels
@ewels
Jun 08 2016 15:03
multiqc <directory> typically. See http://multiqc.info/ for more details
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:03
yep, but is it a step into a nextflow pipe?
Phil Ewels
@ewels
Jun 08 2016 15:04
Yup, exactly - currently we have it as a final process that waits for everything else to finish: https://github.com/ewels/NGI-RNAseq/blob/master/main.nf#L620-L638
Don't get me wrong - it seems to work fine :) Just the warning about asynchronous copying makes me paranoid that it could run before things have finished copying across to the results directory
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:06
I understand that :)
but in your case it's fine because the publishing is applied asynchronously when file are copied
but since you did't specified that they are just symlinked, and that is done is a synchronous manner
Phil Ewels
@ewels
Jun 08 2016 15:09
hah, that's on our to-do list to change ;) (change to copy, remove work directory after successful completion)
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:11
I'm thinking if there's a more nextflow compliant approach ..
Phil Ewels
@ewels
Jun 08 2016 15:11
It's not the first time that this sort of thing has come up - any containerised apps struggle with providing a single directory for MultiQC to crawl
Can also provide file paths or a file full of paths to individual files instead of a directory if that helps (that's how we solved the same problem for bcbio)
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:13
the point is that the multiqc steps depends on all the other steps
so formally you should declare that dependencies
Phil Ewels
@ewels
Jun 08 2016 15:14
yeah, currently we cheat again by using empty .done files from the final couple of processes (then ignoring them)
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:14
I saw, but it's a trick
Phil Ewels
@ewels
Jun 08 2016 15:14
caught red handed ;)
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:17
I would add in each process an extra channel that would be used to bring the required outputs in the multiqc step
then in that step you would have
process multiqc {
  input: 
  file ('fastqc/*') from fastqc_results 
  file ('trim_galore/*') from trim_galore_results 
  : 

'''
multiqc -f  .
'''
}
note that using that relative path in the input file automatically stages that files in that subfolders
Phil Ewels
@ewels
Jun 08 2016 15:21
huh, nice - didn't know you could do that
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:21
so basically you recreate the structure expected by multiqc
without tricks !
:)
Phil Ewels
@ewels
Jun 08 2016 15:21
probably not required by multiqc actually - i don't think it will care if all files are in the wd without directories
only downside I can see is that the multiqc process can get pretty long if it's a biggish pipeline, but probably worth it to do it properly
other approach i wondered about was whether there was some magic variable pointing to working directories used in the pipeline? Then it could just list off the different work directories
e.g. :
multiqc -f work/b2/b2ksjhdbf work/ef/ef43fwejkhbw work/38/38sdkjfbdb [..etc..]
But not sure if they ever get removed after a process is complete and before the pipeline has finished?
Rickard Hammarén
@Hammarn
Jun 08 2016 15:24
Would that require declaring the output files from fastqc into the fastqc_results channel?
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:25
@ewels but this would not be portable, in a cloud won't work
Phil Ewels
@ewels
Jun 08 2016 15:25
@pditommaso: ok, I thought as much. We'll try your approach :)
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:25
ok
Phil Ewels
@ewels
Jun 08 2016 15:25
@Hammarn: can just use fastqc_zip? We're not actually using that anyway
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:26
@Hammarn was a question to me? (I did't get it)
Phil Ewels
@ewels
Jun 08 2016 15:26
Thanks for the help @pditommaso
Rickard Hammarén
@Hammarn
Jun 08 2016 15:26
@pditommaso Yes :P
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:26
you are welcome
ah yes
the files that you need in the multiqc steps
you could add
output:
: 
    file '*_fastqc.{html,zip}' into fastqc_results
if you need both html and zip files
but Phil already said you need just fastqc_zip
Phil Ewels
@ewels
Jun 08 2016 15:29
We don't - this is MultiQC specific rather than nextflow specific anyway so we can figure this stuff out ourselves I think
@Hammarn: For reference, this is what MultiQC uses: https://github.com/ewels/MultiQC/blob/master/multiqc/utils/search_patterns.yaml
Rickard Hammarén
@Hammarn
Jun 08 2016 15:34
Ok, good =)
Another question/issue:
The snapshot you released erlier does not seem to work as well as I initially thought. I don't get the error I was getting earlier but there is still issues with trying to load modules. I think the bioinfo-tools module is not being loaded properly so I end up with errors like this:
 cat .command.log
Lmod has detected the following error: These module(s) exist but cannot be
loaded as requested: "picard/2.0.1"

   Try: "module spider picard/2.0.1" to see how to load the module(s).
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:38
that's bad, could you open an issue for that trying to debug that snippet (.command.env)
Rickard Hammarén
@Hammarn
Jun 08 2016 15:38
Sure
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:38
tx
Mike Smoot
@mes5k
Jun 08 2016 15:39
@pditommaso Just wanted to provide feedback on the publishDir saveTo question. My idea is to overload the path keyword so that it either points to a string identifying the path (current behavior) or a closure that returns a path. This wouldn't address renaming an output file, but perhaps that could be a separate rename closure that allow output files to be renamed. This assumes it's possible for path to be a string OR closure...
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:42
@mes5k I will reply later, tx for the feedback
Rickard Hammarén
@Hammarn
Jun 08 2016 15:44
What do you want me to include? the .comman.log and .command.env anything else?
Paolo Di Tommaso
@pditommaso
Jun 08 2016 15:46
The best would be if you could run it to understand what's wrong
Just running bash -x .command.env
Or include this output
Paolo Di Tommaso
@pditommaso
Jun 08 2016 19:44
@mes5k That problem is that the path attribute can be already a closure with a different semantic respect to saveAs. So, it's not an option.
Mike Smoot
@mes5k
Jun 08 2016 19:46
Ok, didn't realize that.
Just out of curiosity can you tell me what the semantics for a closure are for the path attribute? I didn't see anything in a brief look at the docs.
Paolo Di Tommaso
@pditommaso
Jun 08 2016 19:50
because path is implicitly used for publishDir '/something'
at the same time (almost) any directive can be defined with closure, so you can write
publishDir { .. }
which allows to define the target path dynamically depending the process inputs
Mike Smoot
@mes5k
Jun 08 2016 19:53
Is the argument to the publishDir closure each file from the output?
Paolo Di Tommaso
@pditommaso
Jun 08 2016 19:53
but this is different from saveAs because the latter is invoked for each published file not just for each process execution
Is the argument to the publishDir closure each file from the output?
No, it doesn't receive any argument
Mike Smoot
@mes5k
Jun 08 2016 19:54
Ok, got it.
Mike Smoot
@mes5k
Jun 08 2016 20:01
What about pathAs for a path closure and fileAs for a name closure? It seems like some of the concern is that saveAs and saveTo conflate the path naming and file naming, correct?
Paolo Di Tommaso
@pditommaso
Jun 08 2016 20:04
changing existing parameters it's always something to be avoid as much as possible
it's very annoying for people relaying on it
for this reason I would prefer to not do that
Mike Smoot
@mes5k
Jun 08 2016 20:06
I was suggesting adding new parameters, not changing existing ones. I am very familiar with the trials and tribulations of maintaining backwards compatibility! :)
Paolo Di Tommaso
@pditommaso
Jun 08 2016 20:09
I see, I fear the difference between pathAs and fileAs is too subtle to be easily remembered
(uff ... typos!)
:)
Mike Smoot
@mes5k
Jun 08 2016 20:11
Maybe dirAs? But if you'd prefer not to have two parameters then I think either saveAs or saveTo would be fine.
Paolo Di Tommaso
@pditommaso
Jun 08 2016 20:19
I'm more oriented to have only one new param for that