These are chat archives for nextflow-io/nextflow

18th
Mar 2016
Tiffany Delhomme
@tdelhomme
Mar 18 2016 08:50
thanks for your pull request on bam_realignment-nf
do I need the same form for cpu declaration in a process? i.e. cpu params.cpu or with the assignment operator?
Paolo Di Tommaso
@pditommaso
Mar 18 2016 08:51
yes, I'm sending one more pull request for that
Tiffany Delhomme
@tdelhomme
Mar 18 2016 08:51
ok! thanks!
Paolo Di Tommaso
@pditommaso
Mar 18 2016 08:53
done
this is better because in this way nextflow can request the cpus needed when submitting the job(s) to the cluster
Tiffany Delhomme
@tdelhomme
Mar 18 2016 08:57
so if I use params.cpu it can not?
Paolo Di Tommaso
@pditommaso
Mar 18 2016 08:57
no
Tiffany Delhomme
@tdelhomme
Mar 18 2016 08:59
but what is the difference for nextflow between params.cpu and task.cpus?
Paolo Di Tommaso
@pditommaso
Mar 18 2016 08:59
to allocate resources you need to use process directives like cpus, memory, etc
that task.cpus is defined by nextflow using the information you provided using the cpus directive
in practice the important thing is to use the memory declaration
then in the process script you could also params.cpu like before
Paolo Di Tommaso
@pditommaso
Mar 18 2016 09:04
but in most cases different processes need a different number of cpus
using a single cpu parameter it is not possible to handle that use case
Tiffany Delhomme
@tdelhomme
Mar 18 2016 09:05
hum ok I understand
and so, in the process shell part ,task.cpus is equivalent to params.cpu as well as params.mem to task.memory?
Paolo Di Tommaso
@pditommaso
Mar 18 2016 09:05
instead you can specify the memory (directive) for each (or some) processes in your pipeline in the config file
This message was deleted
in the process shell part ,task.cpus is equivalent to params.cpu as well as params.mem to task.memory?
yes, with the limitation said above
Tiffany Delhomme
@tdelhomme
Mar 18 2016 09:06
oh actually I need to use task.parameter if the parameter vary between process, do I?
I will correct GVCF_pipeline-nf in this way
Paolo Di Tommaso
@pditommaso
Mar 18 2016 09:22
No, I would suggest the approach described here
Jason Byars
@jbyars
Mar 18 2016 20:40
I'm seeing some odd behavior with Channel.fromPath(). Here is the scenario /data is shared between all nodes. The workflow is launched from /data/jenkins/workflow. The data to operate on is in /data/users/foo/article/SRPsomething. It has the usual SRA hierarchy project/run/.sra file. So I do targets = Channel.fromPath('/data/users/foo/article/SRPprojectnumber/**.sra'). This gives me the correct list of input files. However, when I call target.toAbsolutePath() on any of the return Path objects, I get /data/jenkins/workflow/file.sra instead of /data/users/foo/article/SRPprojectnumber/SRRrunnumber/file.sra. Any idea how to get the correct path from the Path objects?
Paolo Di Tommaso
@pditommaso
Mar 18 2016 20:43
wait, target in your example is a channel, you cannot apply the toAbsolutePath method
try the following:
Channel
  .fromPath('/data/users/foo/article/SRPprojectnumber/**.sra')
  .println()
what does it print?
Jason Byars
@jbyars
Mar 18 2016 22:24
Sorry, I should clarify. targets is the channel, then in the following process, I do a file x in targets in the inputs. So it is actually x.toAbsolutePath(). When I try your channel suggestion it outputs correct path for all files.
Paolo Di Tommaso
@pditommaso
Mar 18 2016 22:25
I should see your fragment of code
ok, I think I've understood
you should not try to resolve absolute paths in the process context
Are you doing something like this?
nextflow-io/nextflow#9
Jason Byars
@jbyars
Mar 18 2016 22:28

params.wd = "/data/users/kbrayer/charliesarticle"

targets = Channel.fromPath( "${params.wd}/SRP067524/.sra" )
Channel
.fromPath( "${params.wd}/SRP067524/
.sra" )
.println()

process fastqdump {
executor 'local'

input:
file x from targets

when:
!file(FilenameUtils.removeExtension("${params.wd}/fastq/$x")+".fq.gz").exists()

out:
stdout result

script:
out = FilenameUtils.removeExtension("${params.wd}/$x")+".fq.gz"
sra = x.toAbsolutePath()
"""
umask 022
echo "$sra $x"
"""
}

so yes, something quite similar
I'm fine with relative folders, but the part I find strange is the subfolders for the files are missing. Let me try something real quick
Paolo Di Tommaso
@pditommaso
Mar 18 2016 22:34
The point is that x to the file name used to stage the source file
Jason Byars
@jbyars
Mar 18 2016 22:35
yes, so for file x how do I get the rest of the relative path?
Paolo Di Tommaso
@pditommaso
Mar 18 2016 22:35
using the toAbsolutePath() method will resolve it against the app current directory, thus it makes no sense
the point is that the framework is designed to avoid the usage of absolute path to make process portable across different platforms
in practice the x file refers to a symlink to the original file
Jason Byars
@jbyars
Mar 18 2016 22:39
ok, I think my mistake is treating it as a file. I probably should be using val x in targets instead.
Paolo Di Tommaso
@pditommaso
Mar 18 2016 22:40
that could be a workaround
Jason Byars
@jbyars
Mar 18 2016 22:40
yes, but I would like to understand how you intended it to work.
let's say we have the folder of input files symlinked under the current working directory. I'm going to need the relative paths to those input files in script section of the process.
I'm also going to want to do a little manipulation to look for output files in the when: section
Paolo Di Tommaso
@pditommaso
Mar 18 2016 22:43
The intended approach is that nextflow stage the input files in the process working directory for you
so you don't have to care about the absolute file location, just use the relative file given by the framework
if you need to reference more that one file, you will need to declare as inputs in the process
Jason Byars
@jbyars
Mar 18 2016 22:45
ok, I'm fine with doing that.
Paolo Di Tommaso
@pditommaso
Mar 18 2016 22:45
Take in consideration a classic read pair files
Jason Byars
@jbyars
Mar 18 2016 22:46
oh, I get how the symlinking is working now. There is no relative path because the symlinks are created in the individual job folders.
Paolo Di Tommaso
@pditommaso
Mar 18 2016 22:46
You can provide both files as a pair (in the example include also the pair id as first element of the tuple)
There is no relative path because the symlinks are created in the individual job folders.
yes
Jason Byars
@jbyars
Mar 18 2016 22:49
Yes, that example is appropriate. I think I can solve for the output file names in the Channel creation.
Paolo Di Tommaso
@pditommaso
Mar 18 2016 22:49
Exactly, that is the best approach in the nextflow model
You can also filter the non existing ones, without having to use the when statement
Jason Byars
@jbyars
Mar 18 2016 22:50
Hmm, that would be nice, but I'm envisioning having addition groovy functions that do a little more than check whether file exists for the when: section
I find a lot of the tools have a nasty habit of touching the output file before crashing. The output file technically exists, but it is garbage.
So is there anything I can answer for you about Jenkins or cfncluster today?
Paolo Di Tommaso
@pditommaso
Mar 18 2016 22:55
I gave a quick look to the documentation and it looks it manages properly AWS autoscaling
what exactly is your problem using it with nxf?
Jason Byars
@jbyars
Mar 18 2016 22:56
nothing so far. I'm just working out the details.
Paolo Di Tommaso
@pditommaso
Mar 18 2016 22:56
ah, ok
Jason Byars
@jbyars
Mar 18 2016 22:58
I'm clear on what I can do with Jenkins and cfncluster. I'm just wanted to see if it made more sense to move some of the management tasks to NXF.
the Jenkins instance is minimal, so if I want to ask questions where I actually have to poke around in large files, that can be a problem.
it's kind of like the limitations you would run into using AWS Lambda. At first you think that it would be nice to have rules on buckets to index and hash any bam files that uploads, but you quickly realize the most you can really do is queue work.
Jason Byars
@jbyars
Mar 18 2016 23:07
NXF is the missing piece. I can have QC processes early and late in the NXF pipeline. I can send any status info I want to Dynamodb tables, etc. Jenkins will occasionally spin a small cluster for house keeping that I didn't need, but I can live with that.
Paolo Di Tommaso
@pditommaso
Mar 18 2016 23:08
I agree, NXF give a lot more of flexibility because it's not just a workflow engine
It allows you to mix external commands in a dataflow oriented programming environment
Jason Byars
@jbyars
Mar 18 2016 23:11
Right, and I deal with a lot of biologists who want to learn to analyze their data. Their scripting skills are minimal, so I need a DSL that I can describe the pipelines in a way they can understand.
anything that makes the script section easier to read is a win
Paolo Di Tommaso
@pditommaso
Mar 18 2016 23:12
that's exactly the main reason for which we started the project
easy to prototype and easy to read
Jason Byars
@jbyars
Mar 18 2016 23:14
yes, that has been my experience so far.
Paolo Di Tommaso
@pditommaso
Mar 18 2016 23:14
nice
Jason Byars
@jbyars
Mar 18 2016 23:34
Have a nice weekend. I'm going to see if I can get a generic SRA project processing workflow together.
Paolo Di Tommaso
@pditommaso
Mar 18 2016 23:35
Same to you