These are chat archives for nextflow-io/nextflow

16th
May 2017
Bili Dong
@qobilidop
May 16 2017 00:50

I’m still having trouble with passing a directory through a channel. Here’s my test code:

#!/usr/bin/env nextflow

data_ch = Channel.fromPath('data')

process ls {
    input:
    file data from data_ch

    output:
    file "${data}/*"

    exec:
    println data.list()
}

Here data is a directory with several files in it. I expect the code to work, but somehow the directory is not linked to the work dir, thus causing the following error:

ERROR ~ Error executing process > 'ls (1)'

Caused by:
  Missing output file(s) `data/*` expected by process `ls (1)`


Source block:
  println data.list()
Bili Dong
@qobilidop
May 16 2017 00:57
My current guess is that the input will be staged in the work dir only when it's a file but not when it’s a dir.
Paolo Di Tommaso
@pditommaso
May 16 2017 06:44
The problem here is that when using exec: input files are not automatically staged
The following snippet works

data_ch = Channel.fromPath('data')

process ls {
    input:
    file data from data_ch

    output:
    file "${data}/*"

    script:
    println data.list()
    '''
    echo true
    '''
}
also if you want to capture the output directory, you should use:
  output:
    file data into something_ch
otherwise it will returns the files list, not the directory file
Bili Dong
@qobilidop
May 16 2017 06:52
@pditommaso Thank you for the clarification!
chdem
@chdem
May 16 2017 08:21
Hello ! I want to obtain from a channel a string list of filenames with an option '-V' before each.... I'm making tests in the nextflow console (which is a great tool)

this is my first test :

Channel
    .from( 'file1.bam', 'file2.bam', 'file3.bam' )
    .collect {" -V $it" }
    .println()

Of course, the output is not a string but an array

[ -V file1.bam,  -V file2.bam,  -V file3.bam]
Using this in an input by putting it in a val :
process genotype_GVCFs {

    publishDir "${params.outdir}/5-global_GVCF/"

    input:
    set val(gvcfs) from gvcfFiles2.collect { " -V $it" }

    output:
    file {"final.gatk.vcf"} into (gatk_global_gvcf1,gatk_global_gvcf2)

    script:
    """
    java -Xmx${task.memory.toGiga()}g \
    -jar \$GATK_HOME \
    -T GenotypeGVCFs \
    -nt ${task.cpus} \
    -R ${fasta_ref} \
    -L ${params.bed_file} \
    ${gvcfs} \
    -o final.gatk.vcf
    """
}
Evan Floden
@evanfloden
May 16 2017 08:26
You can also use .tokenize(‘,’) to split the list.
chdem
@chdem
May 16 2017 08:26
thank you @skptic ! Let me try this !
Evan Floden
@evanfloden
May 16 2017 08:27
something like list = 'alpha,delta,gamma'.tokenize(',’)
This splits the single list item in the channel into different elements.
chdem
@chdem
May 16 2017 08:33
I may have misspoken
Evan Floden
@evanfloden
May 16 2017 08:33
Ahh, sorry. You want to do the opposite!
chdem
@chdem
May 16 2017 08:33
tokenize do exactly the opposite !
yes ! ;)
Evan Floden
@evanfloden
May 16 2017 08:34
I misread, my bad. You can remember it for next time :laughing:
chdem
@chdem
May 16 2017 08:34
:laughing:
as you can see in my process, I'm able to add the -V option
I now need to be able to obtain all the filenames in the same string, juste like this : -V file1.bam -V file2.bam -V file3.bam
chdem
@chdem
May 16 2017 08:40
The thing that I don't understand is that I'm able to do that with .collect(). This example with only the filenames totally working :
process MainFileMergingDoC {
    publishDir "${results_outdir}/TOTAL", mode: 'copy'

    input:
    file doc_files_list from mainDoC.collect()

    output:
    file(prefix_output_filename+"_TOTAL")

    script:
    """
    paste ${doc_files_list} | column -s \$'\t' -tn > ${prefix_output_filename}_TOTAL
    """
}
(doc_files_list contains all my filenames....I just want to add -V before each filename)
Evan Floden
@evanfloden
May 16 2017 08:48
I would have expected the following to work:
Channel
    .from( 'file1.bam', 'file2.bam', 'file3.bam' )
    .collect{" -V $it" } 
    .join(',')
chdem
@chdem
May 16 2017 08:49
It throws an java.lang.StackOverflowError
(in the nextflow console)
(same in my nextflow script)
Evan Floden
@evanfloden
May 16 2017 08:52
Yeah, same here. Paolo will no doubt have an elegant solution.
chdem
@chdem
May 16 2017 08:55
Ah ! I've make a mistake in my input (using needlessly set), the command line of my process is still not want I expect BUT this is better :
process genotype_GVCFs {
    publishDir "${params.outdir}/5-global_GVCF/"

    input:
    val(gvcfs) from gvcfFiles2.collect { " -V $it" }

    output:
    file {"final.gatk.vcf"} into (gatk_global_gvcf1,gatk_global_gvcf2)

    script:
    """
    java -Xmx${task.memory.toGiga()}g \
    -jar \$GATK_HOME \
    -T GenotypeGVCFs \
    -nt ${task.cpus} \
    -R ${fasta_ref} \
    -L ${params.bed_file} \
    ${gvcfs.join(',')} \
    -o final.gatk.vcf
    """
}
I tried to put the .join(',') in the script section, but I still have a -V file1.bam, -V file2.bam, -V file3.bam result :worried:
Evan Floden
@evanfloden
May 16 2017 08:57
You will just need to be sure you input files are included intot the working directory. My guess would be that just have the values (not the files) is not enough.
chdem
@chdem
May 16 2017 08:57
Thank you @skptic , you help me to progress in my code ! :D
nextflow go find the files int the work/XX/XXXXX folder
Evan Floden
@evanfloden
May 16 2017 08:59
Only if they are specified as an input.
chdem
@chdem
May 16 2017 08:59
it should works if I could remove the ','
what do you mean ?
Evan Floden
@evanfloden
May 16 2017 09:02
In your example, you are using val as input. However I think that in the working directory of the process, there will be no input files. So what you need is to construct val and also include the files themselves as an input with file().
chdem
@chdem
May 16 2017 09:04
OK !
I understand ...
Evan Floden
@evanfloden
May 16 2017 09:07
Here is a horrible hack to get your val input
Channel
    .from( 'file1.bam', 'file2.bam', 'file3.bam' )
    .collectFile{ item -> ["text.txt", " -V " + item ]}
    .map {it -> it.text }
    .println()
chdem
@chdem
May 16 2017 09:09
You ROCK !
:D
Evan Floden
@evanfloden
May 16 2017 09:11
No problem, I am sure there must be a more elegant way though.
Ah, I rember now.
Much better:
Channel
    .from( 'file1.bam', 'file2.bam', 'file3.bam' )
    .collect{" -V $it" } 
    .map{ it -> it.join(',')}
    .println()
You have to use join() on the element in the channel (the list) and not the whole channel :wink:
chdem
@chdem
May 16 2017 09:16
ok, I think I understand the concept
but I still have the -V file1.vcf, -V file2.vcf, -V file3.vcf :(
Evan Floden
@evanfloden
May 16 2017 09:18
Sorry typo
chdem
@chdem
May 16 2017 09:18
input:
    val(gvcfs) from gvcfFiles2.collect { " -V $it" }.map{ it2 -> it2.join(',')}
Evan Floden
@evanfloden
May 16 2017 09:18
remove the ,
in join()
chdem
@chdem
May 16 2017 09:19
pfff of course.... I should have seen it !
Thank you @skptic !
This is working ! You make my day ! :D
Evan Floden
@evanfloden
May 16 2017 09:22
My pleasure @chdem. Let us know if you need help with the input files channel.
chdem
@chdem
May 16 2017 09:23
ok, I'll try to deal with this alone and i will come back here if I need help !
Evan Floden
@evanfloden
May 16 2017 09:23
:+1:
Simone Baffelli
@baffelli
May 16 2017 13:05
Hello! First of all, thank you for your excellent job! As I told you already, you are probabily saving my PhD
A simple question: how do I expand the result of a channel after "collect" such that I can use the resulting list as a list of arguments for/inside of my scripts?
Like that:
process coherence_decay{
  input:
  val cc from cc_hist.collect()
  val name from name.collect()
  val bl from bl.collect()

  output:
  file(hist_plot) into hist_plot

  """
    coherence_histogram.py --names ${name} --hists ${cc} --bl ${bl}
 """
}
If i use it as posted, the script "coherence_histogram.py" is given the representation of a groovy list including brackets, while I just want to pass a list of parameters to my script
Simone Baffelli
@baffelli
May 16 2017 13:15
Nevermind, I've just solved it.
chdem
@chdem
May 16 2017 13:29
@baffelli Pretty close to my previous problem ! @skptic help me on that by adding this in my input :
input:
val(gvcfs) from gvcfFiles2.collect { " -V $it" }.map{ it2 -> it2.join(',')}
Simone Baffelli
@baffelli
May 16 2017 13:29
That's exactly my solution.
chdem
@chdem
May 16 2017 13:29
Great ! :clap:
Simone Baffelli
@baffelli
May 16 2017 13:29
Only, I defined a separate function because it looks cleaner to me!
chdem
@chdem
May 16 2017 13:30
oh, definitly !
Simone Baffelli
@baffelli
May 16 2017 13:32
the only thing snakemake has that I wish nextflow had is some sort of "interface objects" that allow your functions to take a single parameter, say "inputs" which gives you all the paths and parameters as an attribute.
chdem
@chdem
May 16 2017 14:31
I've only test snakemake for few scripts, not enough to be able to make a comparison
Simone Baffelli
@baffelli
May 16 2017 14:33
I've used it quite extensively. It is nice for certain applications, but the pull based model gets really complicate when you want to do parameter sweeps and other combinatorial analyses.
chdem
@chdem
May 16 2017 14:33
I've a strange error trying to output into multiple channels :
output:
file {"final.gatk.vcf.bz"} into compressed_gvcf1, compressed_gvcf2
seems possible to do that (for nextflow-io/nextflow#97)
Simone Baffelli
@baffelli
May 16 2017 14:35
Yes, it is
Evan Floden
@evanfloden
May 16 2017 14:35
@chdem What is the error?
Simone Baffelli
@baffelli
May 16 2017 14:35
but why is the filename in a closure?
I don't think it is necessay in this case
chdem
@chdem
May 16 2017 14:36
ERROR ~ No such variable: process
thanks @baffelli , the closure was the problem !
Evan Floden
@evanfloden
May 16 2017 14:37
Did you try ( instead of {
chdem
@chdem
May 16 2017 14:37
:clap:
@skptic nope, I've simply tried without {
Simone Baffelli
@baffelli
May 16 2017 14:38
I think () can be left out in most cases
Evan Floden
@evanfloden
May 16 2017 14:38
:+1:
chdem
@chdem
May 16 2017 14:38
thanks guys !
Simone Baffelli
@baffelli
May 16 2017 14:39
:smile:
Paolo Di Tommaso
@pditommaso
May 16 2017 15:04
@baffelli interesting, how would a script benefit by using an interface object?
Simone Baffelli
@baffelli
May 16 2017 15:09
I'll try with an example: I am now computing the correlation between all pairs of images that have a certain maximum lag. Afterwards, I extract pixels from a region of interest from each correlation image and I compute an histogram of the correlation value. Finally, I collect all histograms, bin them according to the lag and average them, in order to obtain the empirical PDF of correlation values versus lag. However, the last step is quite incovenient at the moment, as I collect all lags and histograms files and pass them as a veeeery long list of input files to my empirical pdf script, that loads them and computes the final distribution.
It would be much cleaner if I could just pass an interface object to my python code, whitout having to bother to parse them using argparse. And I consider myself lucky that python has an excellent commandline parser in the standard library.
/*
* Collect all histograms
* in a single file and plot it
*
*/
process coherence_decay{
  publishDir "$params.results/"

  input:
    val cc from cc_hist.collect().map(make_string)
    val name from name.first()
    val bl from bl.collect().map(make_string)

  output:
    file("coherence_decay_${name}.pdf") into hist_plot

  script:
    """
    coherence_histogram.py --names ${name} --hists ${cc} --bl ${bl} --outfile coherence_decay_${name}.pdf
    """
}
Félix C. Morency
@fmorency
May 16 2017 15:14
sounds like unnecessary coupling to me
Simone Baffelli
@baffelli
May 16 2017 15:14
In what sense
Félix C. Morency
@fmorency
May 16 2017 15:16
and also, the script section can be python
Simone Baffelli
@baffelli
May 16 2017 15:16
I know very well, but that's not my point.
The point is that it would be nicer if you could pass your inputs to your code are they were a native python object instead of relying on string expansion/parsin of input arguments. Of course this way is very powerful and flexible, but experience thaught me that it can be incredibly fragile at times.
chdem
@chdem
May 16 2017 15:19
@baffelli : do you have an exemple of fragility ?
(after reading the snamake doc, I think I understand your point)
Simone Baffelli
@baffelli
May 16 2017 15:20
Well, a simple example. I am relying on the fact that groovy and python's syntax for list are the same to do the following in my script:
process mask_to_text{

  input:
  file lut from lut
  file inverse_lut from inverse_lut
  set file(dem_seg), file(dem_seg_par) from dem_seg
  set file(ref_mli), file(ref_mli_par) from ref_mli_mask
  each center from params.rois
  val ws from params.ws

  output:
  set file(coord), val(feature_name) into coords

  shell:
  feature_name=center[1]
  log.info "Extracting ROI coordinates for feature ${name} at LV03 coordinates ${center}"
  '''
  #!/usr/bin/env python3
  import pyrat.geo.geofun as gf
  import numpy as np
  import itertools
  import csv
  map = gf.GeocodingTable("!{dem_seg_par}", "!{lut}", "!{ref_mli_par}", "!{inverse_lut}")
  center_radar = map.geo_coord_to_radar_coord(!{center[0]})
  ws = !{ws}
  sl_r = (center_radar[0] -ws, center_radar[0] + ws)
  sl_az = (center_radar[1] -ws, center_radar[1] + ws)
  with open("coord",'w+') as of:
    writer = csv.writer(of, delimiter=' ')
    for cnt, (r, az) in enumerate(itertools.product(sl_r, sl_az)):
      writer.writerow([int(r), int(az), cnt+1])
  '''
}
where !{center[0]} expands to something like [623956.28, 106116.17], which fortunately happens to be the same way to express a list of floats in both groovy and python. Now, suppose that for some unfathomable reason, the string representation of a list in groovy would change to list_623956.28, 106116.17_list. This would expand into a garbage string in my python code and the script would not work anymore. I see it as fragility.
Félix C. Morency
@fmorency
May 16 2017 15:23
this is not fragility, this is bad design
and unnecessary coupling. as you said, if something somewhere changes, it breaks everything
Simone Baffelli
@baffelli
May 16 2017 15:24
I know, but it removes a lot of boilerplate code. The alternative would be to use `argparse for each inputs/output of the function.

and unnecessary coupling. as you said, if something somewhere changes, it breaks everything

That's why an interface object could help, you would have to change only its implementation, leaving the users code unchanged

Félix C. Morency
@fmorency
May 16 2017 15:25
yes. argparse allows you to 1) sanitize your inputs, 2) make sure each input is of the right type and 3) allow easier maintenance
Simone Baffelli
@baffelli
May 16 2017 15:26
4) Produce a lot of boilerplate, especially if your function takes mutliple inputs/outputs
Félix C. Morency
@fmorency
May 16 2017 15:28
I just don't see the boilerplate argument. What would be the lifespan of the interface object?
Do you have an url with said snakemake feature?
Simone Baffelli
@baffelli
May 16 2017 15:29
Well, it is an additional level of concern, which is separated from the analysis of your data. It would be available during the invocation of the script.
Félix C. Morency
@fmorency
May 16 2017 15:34
The closest to that in NF are templates and native execution
Chris Fields
@cjfields
May 16 2017 15:35
@pditommaso I sometimes think that a Nextflow 'Best Practices'/'Cookbook' would be great, just from the point of view of answering common 'How do I do X' questions.
Simone Baffelli
@baffelli
May 16 2017 15:36
Well, I'm aware of that. But it still either string substitution or groovy code. Be aware that I am not asking how to do it, just trying to understand why such a feature is not desirable/needed
Chris Fields
@cjfields
May 16 2017 15:36
@pditommaso of course that means cloning you, which is (currently) not possible
Félix C. Morency
@fmorency
May 16 2017 15:39
imho, it adds unnecessary coupling between language X and Nextflow. One would need to maintain said interface for the various language X versions. Depending on the implementation, this can leads to security issues. In snakemake, can you modify the interface from within the user code?
Having single executables with cmdline interfaces that can sanitize inputs is, to me, more elegant, more KISS and easier to maintain. you can reimplement process Y in any language and if the cmdline interface is the same, it will work without having to change anything
Simone Baffelli
@baffelli
May 16 2017 15:44
It may be, but it depends where your focus is. In many case, having to do that for each executable separately can lead to lots of repetitions and boilerplate code. Plus, having an interface object would reduce the cognitive burden for inexperienced users. They could just access the data using the same names they declared in the pipeline script, which I see as an huge plus in several cases.
Félix C. Morency
@fmorency
May 16 2017 15:45
shortcuts with a lot of implications :)
here, we also use the scripts outside of NF
Simone Baffelli
@baffelli
May 16 2017 15:48
I may not have understood how it works, but another case that strikes me as odd is when using dynamically named output files: I must define its name in the outputs and I must again define it from whitin my script/command, making sure that these strings match, in violation of DRY. In that case, it would be nicer if I could just reference it by name in my command and nextflow would expand it as needed.
shortcuts with a lot of implications :)
I agree, that's the most annoying feature of snakemake: you almost cannot debug your scripts standalone. However, there are some cases where interface object could simplify the code.
Mike Smoot
@mes5k
May 16 2017 16:10
@baffelli one approach I've used with some success that might be an alternative to interface objects is to write JSON and/or YAML files containing "objects" to be passed around (e.g. long file lists). Pretty much any language you choose speaks JSON or YAML.
Simone Baffelli
@baffelli
May 16 2017 16:11
That's an excellent idea indeed!
Paolo Di Tommaso
@pditommaso
May 16 2017 16:12
@cjfields I definitely agree on both things :)
Félix C. Morency
@fmorency
May 16 2017 17:19
@mes5k +1
@baffelli +1 for the output string matching thing. That might be something to improve in nf
Bili Dong
@qobilidop
May 16 2017 17:31
@baffelli @fmorency I might have missed the context of the discussion. For the name matching thing, a workaround I’ve been using is to define a variable in the script section and refer to it in both script and output. For example
#!/usr/bin/env nextflow

num = Channel.from( 1, 2, 3 )

process example {
    input:
    val x from num

    output:
    file "${y}" into outputs

    script:
    y = "${x}.out"
    """
    echo ${x} > ${y}
    ""”
}
Félix C. Morency
@fmorency
May 16 2017 17:32
cute
although still dangerous in the @baffelli context
Simone Baffelli
@baffelli
May 16 2017 19:20
@qobilidop I thought of exactly the same solution while running :smile: