These are chat archives for nextflow-io/nextflow

18th
Jan 2017
Eivind Gard Lund
@eivindgl
Jan 18 2017 10:57
I am really impressed by nextflow and I am trying to convert an existing shell/python workflow into nextflow. I am currently trying to write a process that accepts a csv file, joins all the entries of a specific column and runs a shell command on said entries. Is this possible with nextflow? I saw the splitCsv operation, but I am unsure about how to proceed when the input to the process is a "file" and not just text... I would appreciate any help :)
Paolo Di Tommaso
@pditommaso
Jan 18 2017 11:12
@eivindgl Hi there, you can manage with NF for sure.
a process that accepts a csv file, joins all the entries of a specific column and runs a shell command on said entries.
what do you mean for that?
how are you supposed to join all the entries ?
Eivind Gard Lund
@eivindgl
Jan 18 2017 11:34
@pditommaso Thank you. Lets forget about csv and just assume an input file with one path per line. If the input file content is "file1\nfile2\n", then I would like my nf process to execute """cat file1 file2"""". How would I accomplish this?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 11:52
I see, but I guess you want to execute a process run for each line/file, right?
Eivind Gard Lund
@eivindgl
Jan 18 2017 11:55
No, only once for every file/line.
Paolo Di Tommaso
@pditommaso
Jan 18 2017 11:58
in this case I would manage everything in the process script
Eivind Gard Lund
@eivindgl
Jan 18 2017 11:58
This does nearly what I want. I just need to join the list into a string separated by space: Channel.fromPath('sample_meta.csv').splitCsv(header: true).map { it.sample_path }.toList()
Paolo Di Tommaso
@pditommaso
Jan 18 2017 11:59
nearly
Channel.fromPath('sample_meta.csv').splitCsv(header: true).map { file(it.sample_path) }.toList()
you need file to convert the path string to a file object
Fredrik Boulund
@boulund
Jan 18 2017 16:13
Hi! Quick question, trying to understand how the process directory works...
What's the best way to control the size of the process directory? I don't want to keep any unnecessary process intermediates as soon as any downstream processes have consumed what they need
Paolo Di Tommaso
@pditommaso
Jan 18 2017 16:18
Hi, temporary files are automatically deleted as long as you define process.scratch = true in the config file
Fredrik Boulund
@boulund
Jan 18 2017 16:18
sweet! then that shouldn't be a problem
Thanks for such a quick answer btw! :)
Paolo Di Tommaso
@pditommaso
Jan 18 2017 16:20
however I think you are looking for this feature
Fredrik Boulund
@boulund
Jan 18 2017 16:21

so something like this:

process {
     executor = 'slurm'
     scratch = true 
}

in my nextflow.config would do it?

Paolo Di Tommaso
@pditommaso
Jan 18 2017 16:21
yes
see the GH issue for the complete answer
Fredrik Boulund
@boulund
Jan 18 2017 16:22
I see now! yeah, it looks exactly like what I'm looking for
I also come from a Snakemake background and I liked the way it was handled there
Paolo Di Tommaso
@pditommaso
Jan 18 2017 16:22
I was guessing that :)
NF works in a different way
at the very end, you can just drop the work dir when your pipeline complete to drop all intermediate files.
Fredrik Boulund
@boulund
Jan 18 2017 16:24
yeah, that's good
I'm also concerned about the size of the work dir during execution of say hundreds of samples
Félix C. Morency
@fmorency
Jan 18 2017 16:25
Make sure you are copying the results first before dropping work :)
Fredrik Boulund
@boulund
Jan 18 2017 16:25
@fmorency, thanks! I set a publishDir for all my important files to somewhere else, that should solve that problem, right?
Thanks a lot for your help so far! I'll be back with more questions when I get stuck again! :)
Félix C. Morency
@fmorency
Jan 18 2017 16:26
@boulund Make sure to set the correct mode. By default, the publishDir will contains symlinks to work
Paolo Di Tommaso
@pditommaso
Jan 18 2017 16:27
yep, use an hardlink or copy
Fredrik Boulund
@boulund
Jan 18 2017 17:40
thanks, got it! Already set them to copy :)
Paolo Di Tommaso
@pditommaso
Jan 18 2017 17:41
:+1:
Fredrik Boulund
@boulund
Jan 18 2017 17:41
I was thinking a bit about the staging mode settings
is there a reason for wanting to copy when staging out files?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 17:42
um, they are always copied when staging out files
Fredrik Boulund
@boulund
Jan 18 2017 17:43
ah, my bad! sorry
was reading for the stageInMode settings and didn't realize they were different for in and out
That makes perfect sense
Paolo Di Tommaso
@pditommaso
Jan 18 2017 17:44
it depends if you use scratch or not
Fredrik Boulund
@boulund
Jan 18 2017 17:45
I think I'll always use scratch, at least for the type of workflows I'm considering right now
Paolo Di Tommaso
@pditommaso
Jan 18 2017 17:45
when scratch = true the process is execute on the local disk, thus both stage in and out requires files to be copied
Fredrik Boulund
@boulund
Jan 18 2017 17:46
ah, you can't symlink to the scratch dir?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 17:46
when scratch = false input files are symlinked in the task work dir and stage out is not required
Fredrik Boulund
@boulund
Jan 18 2017 17:48
It's not a big deal if you can't. I was just thinking there's really no need for some of my processes to copy input files to the scratch dir if I only intend to do a single linear read of their contents, but I still want to stage in all the database files I need, since they're going to require a lot of random access that I don't want to do via the slow shared file system
but copying all input files to the stage dir wouldn't make a big difference anyhow -- most likely the files will still be in file system cache on the local nodes so reading them from the scratch dir will be super quick anyway.
Paolo Di Tommaso
@pditommaso
Jan 18 2017 17:49
Wait, my fault. When scratch = true input files are symlinked as well
but in some cases you may want to copy them for performance reason
Fredrik Boulund
@boulund
Jan 18 2017 17:50
Ok, so symlinking input files into the task work dir is always default?
good to know
Paolo Di Tommaso
@pditommaso
Jan 18 2017 17:50
yes
Fredrik Boulund
@boulund
Jan 18 2017 17:51
Another thing. What's best practice when working with conda environments?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 17:51
good question !
Fredrik Boulund
@boulund
Jan 18 2017 17:51
should I produce one big conda environment that contains everything I need for all steps of the workflow, or should I make small ones for each subprocess that requires one, and just use beforeScript to load the relevant conda environment for each process?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 17:52
so far, there's any best practice .. :)
but I'm interested in any suggestion because there's an idea to provide a built-in support for it
Fredrik Boulund
@boulund
Jan 18 2017 17:54
I haven't decided yet what I think I'd prefer
my first idea was to just make lots of small, clean, conda environments for each specific process... But then I realized that's a lot of work... Probably better to just set PATH=/path/to/conda/env/bin and make sure everything that's needed for the entire workflow is available in there
that way, it's easier to make a single requirements file that describe all the various dependencies for the workflow.
Paolo Di Tommaso
@pditommaso
Jan 18 2017 17:55
in my opinion it should be defined an environment for each process
yes, if configuring manually likely you can just define an environment for the overall pipeline execution
Fredrik Boulund
@boulund
Jan 18 2017 17:56
Using a single monolithic conda env would make installing any dependencies a snap; consider if all the things you want to run are available via either the default conda channel, pip, or e.g. via bioconda or conda-forge channels
Paolo Di Tommaso
@pditommaso
Jan 18 2017 17:57
when managed by NF it should be possible to define a conda env for each process
Fredrik Boulund
@boulund
Jan 18 2017 17:57
yeah, that feels like a natural solution I guess
I suppose you're thinking something similar to the module directive?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 18:02
exactly
Fredrik Boulund
@boulund
Jan 18 2017 18:02
Cool
Félix C. Morency
@fmorency
Jan 18 2017 18:08
Our complete pipeline runs using NF + Singularity! \o/
Paolo Di Tommaso
@pditommaso
Jan 18 2017 18:08
bravo!
Fredrik Boulund
@boulund
Jan 18 2017 18:24
I have another small question, not sure I understand the Groovy syntax properly
Paolo Di Tommaso
@pditommaso
Jan 18 2017 18:25
go
Fredrik Boulund
@boulund
Jan 18 2017 18:25
In NGI's RNAseq pipeline they have these definitions in their config file:
  $makeSTARindex {
    module = ['bioinfo-tools', 'star/2.5.1b']
    cpus = { 10 * task.attempt }
    memory = { 80.GB * task.attempt }
    time = { 5.h * task.attempt }
  }
what does it mean that the definition is prefixed by $ in this case?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 18:26
that's the definition for process named makeSTARindex
Fredrik Boulund
@boulund
Jan 18 2017 18:26
in their main.nfthey have the a process of the same name defined... Do these definition merge with those in main.nfthen?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 18:27
yes, more exactly the defs in the config file override the process defs in the main script
Fredrik Boulund
@boulund
Jan 18 2017 18:27
ah, nice! It was as I thought then.. but why the dollar sign prefix in the configuration file?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 18:29
the $ is just an escape character to avoid name collision in the config file
for example
process.foo = <val>
and
process.$foo.bar = <val>
in the first case foo is a directive applied to all process
in the second case bar is a directive applied to process foo
Fredrik Boulund
@boulund
Jan 18 2017 18:32
Hmm... I don't really understand.. What happens if you write process.foo.bar = <val>then?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 18:33
well, nothing
Fredrik Boulund
@boulund
Jan 18 2017 18:33
(maybe that's not even valid Groovy; I'm not very used to groovy)
Paolo Di Tommaso
@pditommaso
Jan 18 2017 18:33
the point was to allow possible future extension
let's say that you have a pipeline defining a process foo
and at some point we want to add in NF a directive named foo
Fredrik Boulund
@boulund
Jan 18 2017 18:35
aha! Now I get it. ok
Paolo Di Tommaso
@pditommaso
Jan 18 2017 18:35
that would break an existing script, thus that's just a convention to identify process names
Fredrik Boulund
@boulund
Jan 18 2017 18:37
speaking of Groovy in general, let's say I want to define a long messy PATHin my config file, consisting of many separate paths. What's the best way?
Is there a way to do like in Python: ":".join(list_of_paths)?
(i.e. taking each path string, concatenating them all, each separated by a colon sign)
Paolo Di Tommaso
@pditommaso
Jan 18 2017 18:38
yes, that's list_of_paths.join(':')
:)
Fredrik Boulund
@boulund
Jan 18 2017 18:38
nice!
Paolo Di Tommaso
@pditommaso
Jan 18 2017 18:39
to play with the code you can use nextflow console
Fredrik Boulund
@boulund
Jan 18 2017 18:39

so a multi-line solution in a configfile would be something like this?

PATH = ['first/path', 
   'second/long/path',
   'third/super/long/path'].join(':')

?

Paolo Di Tommaso
@pditommaso
Jan 18 2017 18:39
yep
Fredrik Boulund
@boulund
Jan 18 2017 18:39
awesome!
Paolo Di Tommaso
@pditommaso
Jan 18 2017 18:39
indeed!
Fredrik Boulund
@boulund
Jan 18 2017 19:57
Hi again,
I don't understand how to work with accepting tuples of files from a channel... Why doesn't this work?
Channel
  .fromFilePairs(params.input_reads, size: -1)
  .ifEmpty{ exit 1, "Cannot find reads: ${params.input_reads}"}
  .into {input_channel}

process test { 
  input:
  set file(read_1), file(read_2) from input_channel

  output:
  file "${reads_1.baseName}.out" into output_channel

  """
  echo $reads_1 $reads_2
  """
}
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:00
because fromFilePairs produces tuples in which the first element is the pair id and the second element the list of files
thus the input should be
  input:
  set par_id, file(reads) from input_channel
Fredrik Boulund
@boulund
Jan 18 2017 20:02
aha! so I could echo ${reads[0]} ${reads[1]} in this case?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:02
yes
Fredrik Boulund
@boulund
Jan 18 2017 20:02
and then the process would run on each pair of files specified on the command line?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:03
yes, pairs need to be specified with a glob file pattern
Fredrik Boulund
@boulund
Jan 18 2017 20:05
ok, rats...
the filenames are a big mess with little structure...
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:05
um
you can still provide a custom rule to match them
.fromFilePairs(params.input_reads) { file -> your_logic_here }
Fredrik Boulund
@boulund
Jan 18 2017 20:08
I feel so constrained not being familiar with Groovy syntax and stdlib, it's tough learning both nextflow and groovy at the same time... :)
I don't really understand the line between groovy stuff and nextflow stuff either, it's somewhat confusing at times :D
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:09
any java skill ?
Fredrik Boulund
@boulund
Jan 18 2017 20:09
Only from like 10 years ago, mostly forgotten
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:09
quite enough
Fredrik Boulund
@boulund
Jan 18 2017 20:10
overwritten with Python I guess ;)
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:10
groovy is a kind of python for the JVM
Fredrik Boulund
@boulund
Jan 18 2017 20:10
Ok, let's start from the top... (btw, I'm amazed by your time investment in my problems! much appreciated)
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:10
I would suggest to have a look to this 6 pages primer
Fredrik Boulund
@boulund
Jan 18 2017 20:12
I have paired end read data, and I want my processes to consume pairs of files, so I can give commandlines containing both filenames at once.
what's the easiest way to get there?
Félix C. Morency
@fmorency
Jan 18 2017 20:14
Is it a single run? ie. nextflow run pipeline.nf --file1 foo --file2 bar
Fredrik Boulund
@boulund
Jan 18 2017 20:15
I'm not sure what's the best way to implement it...
I was imagining having a directory with hundreds of pairs of files that I want to run through the workflow, expecting to be able to run something like nextflow run pipeline.nf --filenames foo/bar*.fq
Is that possible, or even intended to be possible?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:15
yes exactly
that on the command line became
nextflow run <script> --reads 'blah/blah/foo*.fq'
Fredrik Boulund
@boulund
Jan 18 2017 20:17
that sounds just perfect. I'll study that example and return if I can't figure it out
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:19
don't forget to include the pattern between quote character on the CLI
Fredrik Boulund
@boulund
Jan 18 2017 20:19
Ah, ok!
Then the globbing happens in nextflow, right?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:19
exactly
Fredrik Boulund
@boulund
Jan 18 2017 20:19
instead of the shell serving a space-separated list of files
I think I get it
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:20
exactly
have a look to this tutorial
Fredrik Boulund
@boulund
Jan 18 2017 20:35
I'm still not getting it to work... Where can I find documentation for how channels and methods on channels, like .fromFilePairs are defined?
Fredrik Boulund
@boulund
Jan 18 2017 20:45
Hmm... There's still something I'm missing, maybe time to call it a day and try again tomorrow
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:46
:)
Fredrik Boulund
@boulund
Jan 18 2017 20:47

Any reason this wouldn't work?

Channel
  .fromFilePairs(params.input_reads)
  .into {channel1, channel2}

process test {
  input:
  set pair_id, file(reads) from channel1
}

(omitted some things for brevity)

Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:48
should be fine
what's the problem
Fredrik Boulund
@boulund
Jan 18 2017 20:49
how do I access the two different files in the process body?
like ${reads[0]}etc?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:49
yes
Fredrik Boulund
@boulund
Jan 18 2017 20:52
Hahaha. Oh I am definitely getting tired... missed that I commented out two lines in my script definiton, the two lines producing any actual output
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:52
better to continue tomorrow ;)
Fredrik Boulund
@boulund
Jan 18 2017 20:53
but no.. that's not it...
Here's the whole thing.. I feel like I'm soo close to finally getting this to work...
Channel
    .fromFilePairs(params.input_reads)
    //.ifEmpty{ exit 1, "Cannot find any reads: ${params.input_reads}"}
    .into {input_reads_kaiju;
           input_reads_kraken;
           input_reads_metaphlan2}


process kaiju {
    tag {pair_id}
    publishDir "${params.outdir}/kaiju", mode: 'copy'
    cpus 1
    memory '8 GB'
    time '2h'

    input:
    set pair_id, file(reads) from input_reads_kaiju

    output:
    file "${pair_id}.kaiju" into kaiju_output

    """
    pwd
    echo $reads $pair_id
    head ${reads[0]} | sort > ${pair_id}.kaiju
    head ${reads[1]} | sort >> ${pair_id}.kaiju
    """
}
Specifying --input_reads 'reads/*'on the command line, the readsdirectory contains two files.
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:57
but?
Fredrik Boulund
@boulund
Jan 18 2017 20:57
Running this produces nothing, no output
Paolo Di Tommaso
@pditommaso
Jan 18 2017 20:58
what if u uncomment //.ifEmpty{ exit 1, "Cannot find any reads: ${params.input_reads}"} ?
Fredrik Boulund
@boulund
Jan 18 2017 20:59
ERROR: Cannot find any reads: reads/*
(dammit, had to change computer, my laptop battery died)
gonna take me a minute to set things up again, reconnect ssh and everything
Paolo Di Tommaso
@pditommaso
Jan 18 2017 21:01
the pattern is too short, provide at least the extension eg. reads/*.fastq
Fredrik Boulund
@boulund
Jan 18 2017 21:02
still the same error
Paolo Di Tommaso
@pditommaso
Jan 18 2017 21:02
can you list the files on that folder ?
Fredrik Boulund
@boulund
Jan 18 2017 21:02
[fredrikb@milou1 workspace_tmp]$ ls reads/
reads1.fastq  reads2.fastq
Paolo Di Tommaso
@pditommaso
Jan 18 2017 21:03
what if u use reads/*{1,2}.fastq
Fredrik Boulund
@boulund
Jan 18 2017 21:04
then it appears to work!
Paolo Di Tommaso
@pditommaso
Jan 18 2017 21:04
:+1:
Fredrik Boulund
@boulund
Jan 18 2017 21:04
thanks a lot!
Paolo Di Tommaso
@pditommaso
Jan 18 2017 21:04
welcome
Fredrik Boulund
@boulund
Jan 18 2017 21:04
so the reason it didn't work all this time was that I used a too short pattern?
Paolo Di Tommaso
@pditommaso
Jan 18 2017 21:04
I guess so :/
Fredrik Boulund
@boulund
Jan 18 2017 21:05
hmm.. That's annoying... Good to have that out of the way at least!
I have a new question, but let's save that for tomorrow
:) Thanks for all your amazing help today!
Paolo Di Tommaso
@pditommaso
Jan 18 2017 21:08
good, have a good night
Félix C. Morency
@fmorency
Jan 18 2017 21:14
10mins difference (on a 2h+ pipeline) between local singularity image vs shared singularity image (nas)
not bad
Paolo Di Tommaso
@pditommaso
Jan 18 2017 21:15
overall execution time ?
Félix C. Morency
@fmorency
Jan 18 2017 21:15
2h24 (local) vs 2h33 (nas)
Paolo Di Tommaso
@pditommaso
Jan 18 2017 21:16
fair enough