These are chat archives for nextflow-io/nextflow

14th
Dec 2016
amacbride
@amacbride
Dec 14 2016 00:24

I have a question of long-standing that I'm just getting around to asking: I have a pipeline where I need to keep certain intermediate results as well as the final outputs. I'm currently using storeDir, and then manually copying things once the whole pipeline is finished.

Is it possible to use both storeDir and publishDir for the same process?

(And I'm assuming the NF process won't exit until all the asynchronous copies are finished?)
Paolo Di Tommaso
@pditommaso
Dec 14 2016 00:28
Yes, it should work
Shellfishgene
@Shellfishgene
Dec 14 2016 08:36
Hi! We are using an annoying version of PBS: The syntax is slightly different from PBSPro or Torque, and it does not support DRMAA. How hard is it to add support to Nextflow? For Bpipe I have just changed the adaptor bash script they use.
Paolo Di Tommaso
@pditommaso
Dec 14 2016 08:40
Nextflow does provide a generic adapter (though it could be a nice extension)
Lukas Jelonek
@lukasjelonek
Dec 14 2016 08:41
Did something change with the structure of the params map?
if (params.db) { println "db is set to ${params.db}" }
leads to
ERROR ~ No such variable: params.db
Paolo Di Tommaso
@pditommaso
Dec 14 2016 08:42
yes, params need to be initialised before their usage
Lukas Jelonek
@lukasjelonek
Dec 14 2016 08:42
I can use the containsKey() Method, but it would be tedious to change it everywhere
Paolo Di Tommaso
@pditommaso
Dec 14 2016 08:43
you should need to provide a default value at the top of your script
Lukas Jelonek
@lukasjelonek
Dec 14 2016 08:43
okay, I'll try that
Thanks
Paolo Di Tommaso
@pditommaso
Dec 14 2016 08:43
welcome
@Shellfishgene however in principale a custom executor it's quite easy to implement, the PBS one is little more than 100 lines of code
how differs your PBS version ?
Shellfishgene
@Shellfishgene
Dec 14 2016 08:45
Yes, I'm just looking at that one.
There is just some differently named options for qstat and qsub and so on. However the biggest difference is that users can't get job information from qstat after the job is finished or canceled.
Paolo Di Tommaso
@pditommaso
Dec 14 2016 08:47
However the biggest difference is that users can't get job information from qstat after the job is finished or canceled.
Shellfishgene
@Shellfishgene
Dec 14 2016 08:47
For bpipe I had to add "&& touch $jobno.txt " to all commands in batch files to find out if the job was finished successfully...
Paolo Di Tommaso
@pditommaso
Dec 14 2016 08:47
this is not a big problem, most of the batch schedulers does not provide this feature
Shellfishgene
@Shellfishgene
Dec 14 2016 08:48
ok
how does nextflow find out if a job was cancelled because of time, or command error, or finished?
Paolo Di Tommaso
@pditommaso
Dec 14 2016 08:48
the job submission command is always qsub ?
Shellfishgene
@Shellfishgene
Dec 14 2016 08:49
yes, it's qsub
Paolo Di Tommaso
@pditommaso
Dec 14 2016 08:49
NF polls for a file created when the job finish
Shellfishgene
@Shellfishgene
Dec 14 2016 08:52
So does the PBS executor only work with commandline options to qsub or does it write a script to submit? Sorry , still having a hard time reading groovy...
Paolo Di Tommaso
@pditommaso
Dec 14 2016 08:54
For each job the executor creates the user command script and a launcher script that is used to submit it
Is there any documentation about your PBS installation?
If it's publicly available I can take in consideration to add it to the executors supported by NF
Shellfishgene
@Shellfishgene
Dec 14 2016 09:01
No, our PBS seems very little used, the first hits in google are for our university. It's NQSII, used by NEC on their clusters.
If it's only minor differences in the options, I can maybe do it myself. Another detail that's different I just noticed is the ouput from qstat -f, the parsing needs to be changed a little.
Paolo Di Tommaso
@pditommaso
Dec 14 2016 09:05
PBS and SGE are very similar
I would try to check if the output of your installation match with the of SGE
for example
Shellfishgene
@Shellfishgene
Dec 14 2016 09:10
It's just similar, not the same. For example the state strings are different, ours uses "QUE" or "RUN" instead of single letters. Also the job IDs come with something attached that needs to be removed ("123432.something"), and so on.
Paolo Di Tommaso
@pditommaso
Dec 14 2016 09:11
!
Shellfishgene
@Shellfishgene
Dec 14 2016 09:11
hmm?
Paolo Di Tommaso
@pditommaso
Dec 14 2016 09:12
just surprised .. :)
Shellfishgene
@Shellfishgene
Dec 14 2016 09:12
Yeah, they really need to add DRMAA support...
Paolo Di Tommaso
@pditommaso
Dec 14 2016 09:21
yes, very similar
if you send me the exacts command lines and their output I can arrange an executor for that
Shellfishgene
@Shellfishgene
Dec 14 2016 09:25
That would be great. I'll look at NF some more, and if I decide to use it I will send you that info.
Paolo Di Tommaso
@pditommaso
Dec 14 2016 09:27
Sure, open an feature request on GH if you want https://github.com/nextflow-io/nextflow/issues
Phil Ewels
@ewels
Dec 14 2016 10:56
Hi @pditommaso - I e-mailed Amazon the other day about the possibility of hosting reference genomes as part of their AWS Public Datasets initiative (figured it would be nice to not have to download / build the ref every time we do a run on AWS). They replied really quickly anyway, and I just sent back a questionnaire about it.
I felt a bit out of my depth as I'm not that comfortable with AWS stuff. But hopefully I did an ok job!
Anyway, I mentioned Nextflow as being one tool that supports Docker + AWS natively, hope that's ok
More info about the datasets here: https://aws.amazon.com/public-datasets/
Happy to 'cc you into the thread if you're interested / would like to contribute
Paolo Di Tommaso
@pditommaso
Dec 14 2016 13:04
@ewels very well done, I would be happy contribute/provide help if needed
Shellfishgene
@Shellfishgene
Dec 14 2016 15:54
Beginner question: I want a simple "fastqc" process. Input is the fastq file, "foobar.fastq.gz". fastqc produces 2 output files, "foobar.fastqc.html" and "foobar.fastqc.zip". How do I name these in the output section? I have to bascially remove the ".fastq.gz" from the input file and add the fastqc.hml part.
Paolo Di Tommaso
@pditommaso
Dec 14 2016 15:56
either
process foobar {
   output: 
   file 'foobar.fastqc.html' into ch1
   file 'foobar.fastqc.zip'  into ch2
:
}
or
process foobar {
   output: 
   file 'foobar.fastqc.*' into ch1
:
}
Shellfishgene
@Shellfishgene
Dec 14 2016 15:59
sorry, it's of course not always "foobar", I want to run it on a range of files.
Paolo Di Tommaso
@pditommaso
Dec 14 2016 15:59
ok, if so you can capture by the extension
process foobar {
   output: 
   file '*.fastqc.{html,zip}' into ch1
:
}
but I guess you are looking for something like this
Shellfishgene
@Shellfishgene
Dec 14 2016 16:02
yes, that's it. In your previous example, NF would not be able to distinguish output files from many parallel fastqc runs, right?
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:02
of course
each task runs in its own workdir, so there is not problem about parallel tasks
you can even use always the same
Shellfishgene
@Shellfishgene
Dec 14 2016 16:04
Ah, right. Have to get used to how this works :)
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:06
yes, it has a different approach compared to other tools
Shellfishgene
@Shellfishgene
Dec 14 2016 16:21
So this is what I have so far, the fastqc zip contains a folder with the basename of the original fastq again. How would I get that in this unzip step? http://pastebin.com/FnAbK5aN
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:23
what dod you mean with
unzip $fastqc_zip ???_fastqc/Images/per_base_quality.png
is ???_fastqc/Images/per_base_quality.png the expect path of the unzipped file ?
Shellfishgene
@Shellfishgene
Dec 14 2016 16:26
no, it's the path inside the zip file. I want to extract that file, but the ??? need to be replaced by the sample name (basename of fastq file)
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:27
I see
you can use glob patterns ie wildcards to capture outputs
for exmaple
process unzip {
   input:
   file fastqc_zip from ch1

   output:
   file '*_fastqc/Images/per_base_quality.png'

   """
   unzip $fastqc_zip ???_fastqc/Images/per_base_quality.png
   """
}
or you could just move that file in the base path ..
process unzip {
   input:
   file fastqc_zip from ch1

   output:
   file 'per_base_quality.png'

   """
   unzip $fastqc_zip ???_fastqc/Images/per_base_quality.png
   mv ???_fastqc/Images/per_base_quality.png .
   """
}
Shellfishgene
@Shellfishgene
Dec 14 2016 16:30
The output of the unzip command will only be the png file, the Images folder only exists in the zip file
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:30
the result of the unzip is going to be a file, right?
Shellfishgene
@Shellfishgene
Dec 14 2016 16:30
yes, the problem is how to replace the ??? with the sample name....
yes, the png file
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:31
how is the expected name of the unzipped file?
Shellfishgene
@Shellfishgene
Dec 14 2016 16:32
it's correct as above, the per_base_quality.png. But the unzip tool need to know the folder name inside the zip file to extract it.
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:32
ahhh
Shellfishgene
@Shellfishgene
Dec 14 2016 16:32
unzip 110_S110_L001_R1_001_fastqc.zip 110_S110_L001_R1_001_fastqc/Images/per_base_quality.png
this is the actual command that's supposed to run for that example input file
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:34
???_fastqc is supposed to be equals as the zip file name, without the extension, right?
Shellfishgene
@Shellfishgene
Dec 14 2016 16:34
yes
I read the unique ID FAQ question you linked, but don't know how to apply it here.
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:34
process unzip {
   input:
   file fastqc_zip from ch1

   output:
   file 'per_base_quality.png'

   """
   unzip $fastqc_zip ${fastqc_zip.baseName}/Images/per_base_quality.png
   mv ???_fastqc/Images/per_base_quality.png .
   """
}
Shellfishgene
@Shellfishgene
Dec 14 2016 16:35
heh, that's pretty simple :)
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:35
fastqc_zip is a Path object
it is ! :)
Shellfishgene
@Shellfishgene
Dec 14 2016 16:36
so baseName is a Groovy function?
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:37
Actually it's a NF extension
Shellfishgene
@Shellfishgene
Dec 14 2016 16:37
No such property: baseName for class: nextflow.util.BlankSeparatedList
Hmm...
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:37
ah
because you are capturing both html and the zip in the same channel
Shellfishgene
@Shellfishgene
Dec 14 2016 16:38
Ok, meant to change that actually, don't need html.
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:39
you have to have two different outputs in the previous process
so that you can then handle separately
Shellfishgene
@Shellfishgene
Dec 14 2016 16:40
Ok, only captured the zip and now it works, thanks!
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:40
:+1:
Shellfishgene
@Shellfishgene
Dec 14 2016 16:40
One last question: I now changed the second process a lot and reran the pipeline, but it always also reran the first fastqc process. Is that normal?
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:41
yes, if you don't want to re-execute it add -resume to the NF command line
Shellfishgene
@Shellfishgene
Dec 14 2016 16:42
Ok, should have read that first. Thanks and good night!
Paolo Di Tommaso
@pditommaso
Dec 14 2016 16:42
welcome !
amacbride
@amacbride
Dec 14 2016 23:58
If storeDir or publishDir is an S3 url, will NF handle it the same way it does for input?