These are chat archives for nextflow-io/nextflow

15th
May 2019
Cedric
@Puumanamana
May 15 01:55
@stevekm Thanks, I didn't know you could also have the beforeScript inside the configuration file
Alaa Badredine
@AlaaBadredine_twitter
May 15 06:59
@micans thanks for info !
Paolo Di Tommaso
@pditommaso
May 15 07:42
@stevekm input env yes, global env no. In any case, test it.
Ólavur Mortensen
@olavurmortensen
May 15 08:44
Say I have several samples that need to be processed by several processes, like the one I've written below. Because channels are a type of FIFO queue, the workflow I've written doesn't make sure that the correct samples are processed together. How do I make sure this works correctly?
vcfs_ch = Channel.fromPath(vcf_path + "/*.vcf")
bams_ch = Channel.fromPath(bam_path + "/*.bam")
process A {
    input:
    file vcf from vcfs_ch
    file bam from bams_ch

    script:
    """
    [something]
    """
}
Alaa Badredine
@AlaaBadredine_twitter
May 15 08:46
@olavurmortensen you may want to try something like join https://www.nextflow.io/docs/latest/operator.html#join
or fromFilePairs https://www.nextflow.io/docs/latest/channel.html#fromfilepairs by matching the samples names..
Ólavur Mortensen
@olavurmortensen
May 15 08:49
I see, I could use (sample ID, file) touples, and join files by sample ID.
Would give me (sample ID, VCF, BAM) tuples.
Alaa Badredine
@AlaaBadredine_twitter
May 15 08:50
If I am not mistaken yeah, it should work
Another problem is if input files are processed by processes A and B, and then outputs from A and B are processed by a process C. I think joining by sample ID should solve that as well.
Alaa Badredine
@AlaaBadredine_twitter
May 15 08:56
hmmm if process A and B are two independent processes, then I believe you should join them by sample IDs
micans
@micans
May 15 09:13
@AlaaBadredine_twitter I don't think changing the default strict bash settings in NF is a good idea. Then you cannot distinguish true errors from cases like this (grep). When you use the idiom grep || true your intentions become very explicit in the code, and you let NF catch all other errors; I think it is the clearest and most robust solution.
Alaa Badredine
@AlaaBadredine_twitter
May 15 09:16
@micans I agree with you, I will stick with that for the time being ! Thanks for the feedback
micans
@micans
May 15 09:16
Cool! :+1:
Paolo Di Tommaso
@pditommaso
May 15 09:18
you can also just do at task level
(set +e; grep whatever .. )
micans
@micans
May 15 09:25
To go down this route even further ... you may want to to catch other errors (e.g. file does not exist). The man page says this:
       Normally,  the  exit  status  is  0 if selected lines are found and 1 otherwise.  But the exit status is 2 if an
       error occurred, unless the -q or --quiet or --silent option is  used  and  a  selected  line  is  found.   Note,
       however,  that  POSIX  only  mandates, for programs such as grep, cmp, and diff, that the exit status in case of
       error be greater than 1; it is therefore advisable, for the sake of portability, to use  logic  that  tests  for
       this general condition instead of strict equality with 2.
Alaa Badredine
@AlaaBadredine_twitter
May 15 09:26
that's interesting ! I didn't know about that. I guess I can try some modifications in the pipeline now :D
Chelsea Sawyer
@csawye01
May 15 12:52
For the feature cleanup = true stated in a config file, if I was running the same pipeline at the same time for a few different runs, would this only delete the specific runs' work directory files or the whole work directory of all the runs happening? If I am unclear with my question please let me know
Mazen Mahdi
@Shamanga_13_twitter
May 15 12:59
Hello all!
Let's say I have a process that spawns a docker container for each value in a channel. I would like this process to be parallelized, meaning each container runs on a cpu. I used the local executor with 2 cpus (assume that there are 2 values in the channel), but the processes ran sequential. I also tried using the slurm executor, both jobs were submitted at the same time, but for some reason one of them was pending due to resource allocation even though resources are available, any ideas?
Riccardo Giannico
@giannicorik_twitter
May 15 13:05

Hi, If I have a grouped files channel , how can I use both key and files into a process?

ch_files= Channel.from('fileA_1.txt','fileA_2.txt','fileB_1.txt','fileB_2.txt')
ch_groupedfiles= ch_files.groupBy{ file -> file.name.split(/file/)[1].split(/_/)[0] }
// ch_groupedfiles is  [A:[fileA_1.txt,fileA_2.txt], B:[fileB_1.txt,fileB_2.txt]]
process mytest {
  // I need this to run for each key
  input: 
  set val(mykey), file(files) from ch_groupedfiles
  """
  echo "${mykey} related to ${files}" 
  """
}

the set val(mykey) line in my example is not working properly

najitaleb
@najitaleb
May 15 13:10
what does it mean when a process produces the necessary output file, but still gives this error: Missing output file(s) Jie.enriched.rds expected by process randomNum
I am using docker to run the pipeline if that helps
Chelsea Sawyer
@csawye01
May 15 13:48
Hi @giannicorik_twitter, this works for me:
ch_files= Channel.from('fileA_1.txt','fileA_2.txt','fileB_1.txt','fileB_2.txt')
ch_groupedfiles= ch_files.map{ file -> [file.split(/file/)[1].split(/_/)[0], file] }.groupTuple()

process mytest {
  // I need this to run for each key
  echo true
  input:
  set val(mykey), file(files) from ch_groupedfiles

  script:
  """
  echo ${mykey} related to ${files}
  """
}
Riccardo Giannico
@giannicorik_twitter
May 15 13:59
@csawye01 Yes, thanks it was the .groupTuple() existence I was missing :)
Paolo Di Tommaso
@pditommaso
May 15 14:32
you were an half NF developer then :joy:
Eugene Bragin
@eugene.bragin_gitlab
May 15 14:38
Hi, I have a default publishDir setting. Is there any way to make a particular process not publish at all (its outputs are temporary and used by some following processes)
micans
@micans
May 15 15:35
@eugene.bragin_gitlab I wonder if setting publishDir null for that process does the trick.
Riccardo Giannico
@giannicorik_twitter
May 15 15:43

you were an half NF developer then :joy:

Yeah, now I can finally be a real boy ... :P (Pinocchio cit.)

Venkat Malladi
@vsmalladi
May 15 16:13
Hi all. I have a questions regarding licensing of nextflow pipelines
how are people handling it
and what license are people generalling going with
Paolo Di Tommaso
@pditommaso
May 15 16:18
NF is Apache 2.0, therefore you can do whatever you want as long as you mention it
then depends what's your target user
Venkat Malladi
@vsmalladi
May 15 16:33
I was thinking MIT, but maybe I should do GPL and lisesnce for programs that use workflow override the head license
Any good examples that people have
Just going back and forth with my university on what the license should be
Paolo Di Tommaso
@pditommaso
May 15 16:50
My fav OSS license is MPL
and makes sense for pipeline scripts
Eugene Bragin
@eugene.bragin_gitlab
May 15 17:07
@pditommaso thanks
@micans thanks, will try
Paolo Di Tommaso
@pditommaso
May 15 17:07
welcome
Venkat Malladi
@vsmalladi
May 15 17:57
Thanks will do that
evanbiederstedt
@evanbiederstedt
May 15 22:55
I have a quick question about error strategies:

Let's say we have a standard WGS pipeline, alignment to germline SNV calling using GATK best practices.

Let's also say we try running +500 samples at once, via one Nextflow process.

What we would like to implement is a pipeline which will run all steps to completion. However, there are sometimes problems with the sample quality, i.e. let's assume there are 5 bad samples, some will fail at alignment, others at variant calling.

By default, it appears the pipeline stops at the first error. This makes debugging tricky in the above situation.

If we use ignore, we have a finished pipeline----however, this is suboptimal for debugging as the pipeline has continued with errors (e.g. consider if there are more than 5 bad samples)

If we use maxRetries, that will solve pipeline stops due to glitches, but not in the case when the inputs above could cause errors.

Is there an error strategy you recommend for running NF from start to finish, but the bad samples will "stop" when a problem arises?