These are chat archives for nextflow-io/nextflow

11th
Feb 2019
Tim Dudgeon
@tdudgeon
Feb 11 11:38
Hi, does anyone have any experiences to share with remote submission of NF jobs? I want a generic and robust way to submit a workflow on a remote server that can use a number of executors (including, but not limited to SGE and Ignite). The process needs to be able to stage the files (e.g. using scp), fire off the workflow (e.g. using ssh) and monitor the job for completion, copy the results back, and to do this in a fairly generic manner and handle things like lost network connections.
Harshil Patel
@drpatelh
Feb 11 14:16
@pditommaso Im wondering whether it makes sense to create a Value channel from this block of code instead of forking it into other named channels. The channel should contain a single value which would be a path to a fasta file. It would make things much tidier but Im not sure whether its the best way to do things or whether its even possible...
Paolo Di Tommaso
@pditommaso
Feb 11 14:17
definitely yes
Harshil Patel
@drpatelh
Feb 11 14:17
I tested it but Im getting
Paolo Di Tommaso
@pditommaso
Feb 11 14:18
just use fasta_index = file(params.fasta)
Harshil Patel
@drpatelh
Feb 11 14:18
ERROR ~ Channel `fasta` has been used twice as an input by process `makeGenomeFilter` and process `makeBWAindex`
Paolo Di Tommaso
@pditommaso
Feb 11 14:18
is params.fasta a string path ?
Harshil Patel
@drpatelh
Feb 11 14:19
Ahh ok. Thats the discrepancy. So it wont work with fromPath.
It should be yes. Either a file path or a web link (e.g. Amazon iGenomes).
Paolo Di Tommaso
@pditommaso
Feb 11 14:19
fromPath is only need to resolve wildcards
Harshil Patel
@drpatelh
Feb 11 14:21
You learn something new every day! I was clearly making the wrong assumptions.
Paolo Di Tommaso
@pditommaso
Feb 11 14:21
LOL
there's too much boilerplate in the nf-core pipelines
looking forward to meeting you in april to improve them
Harshil Patel
@drpatelh
Feb 11 14:24
:thumbsup:
This is why we changed from file() to filePath()
Ping @ewels @apeltzer
Paolo Di Tommaso
@pditommaso
Feb 11 14:26
whenever Channel.fromPath(path) and path is not not a glob you can replace with file(path)
Alexander Peltzer
@apeltzer
Feb 11 14:45
Thanks for that ping !!!
:-D
Yup, looking forward to that for sure!
Paolo Di Tommaso
@pditommaso
Feb 11 14:46
we - may - work on first nf-core modularized pipeline ;)
Harshil Patel
@drpatelh
Feb 11 14:46
Wowzas!
Paolo Di Tommaso
@pditommaso
Feb 11 14:47
:joy:
Félix C. Morency
@fmorency
Feb 11 14:47
:confetti_ball:
Maxime Garcia
@MaxUlysse
Feb 11 14:48
\o/
I do want to see that
Can we join on the fun?
Alexander Peltzer
@apeltzer
Feb 11 14:51
Either by coming by in Tuebingen, or I can set up a remote connection / video
Network there should be quite nice, so I guess we can arrange that :+1:
Harshil Patel
@drpatelh
Feb 11 14:52
@apeltzer Was the issue above an edge-case that will require the use of fromPath for things to run on different FS?
Alexander Peltzer
@apeltzer
Feb 11 14:52
Maxime Garcia
@MaxUlysse
Feb 11 14:52
I'm afraid it'll be complicated to join for me this time, but, I'll be there remotely
Alexander Peltzer
@apeltzer
Feb 11 14:52
@drpatelh We had some issues with a markdown template that wasn't staged without these adjustments
(the file was on the same cluster but the /home and the /beegfs folders are on different systems)
(the latter is the storage we'Re running analysis in, but when running nextflow run nf-core/bla it pulls the pipeline to ~/.nextflow/assets/nf-core/bla and thus symlinks across two different FS
Paolo Di Tommaso
@pditommaso
Feb 11 14:59
too excited to share this, are you ready for NF script 2.0 ?
require 'modules.nf', params:[gatk: params.gatk, results: params.results]

log.info """\
C A L L I N G S  -  N F    v 2.0 
================================
genome   : $params.genome
reads    : $params.reads
variants : $params.variants
blacklist: $params.blacklist
results  : $params.results
gatk     : $params.gatk
"""

genome_file     = file(params.genome)
variants_file   = file(params.variants)
blacklist_file  = file(params.blacklist)
reads_ch        = Channel.fromFilePairs(params.reads)

PREPARE_GENOME_SAMTOOLS(genome_file)

PREPARE_GENOME_PICARD(genome_file)

PREPARE_STAR_GENOME_INDEX(genome_file)

PREPARE_VCF_FILE(variants_file, blacklist_file)

RNASEQ_MAPPING_STAR( 
      genome_file, 
      PREPARE_STAR_GENOME_INDEX.output, 
      reads_ch)

RNASEQ_GATK_SPLITNCIGAR(
      genome_file, 
      PREPARE_GENOME_SAMTOOLS.output, 
      PREPARE_GENOME_PICARD.output, 
      RNASEQ_MAPPING_STAR.output)

RNASEQ_GATK_RECALIBRATE(
            genome_file, PREPARE_GENOME_SAMTOOLS.output, 
            PREPARE_GENOME_PICARD.output, 
            RNASEQ_GATK_SPLITNCIGAR.output, 
            PREPARE_VCF_FILE.output)
    . into { final_output_ch; bam_for_ASE_ch }


RNASEQ_CALL_VARIANTS( 
        genome_file, 
        PREPARE_GENOME_SAMTOOLS.output, 
        PREPARE_GENOME_PICARD.output, 
        final_output_ch.groupTuple())

POST_PROCESS_VCF( 
          RNASEQ_CALL_VARIANTS.output, 
          PREPARE_VCF_FILE.output )

PREPARE_VCF_FOR_ASE( POST_PROCESS_VCF.output )

ASE_KNOWNSNPS(
      genome_file, 
      PREPARE_GENOME_SAMTOOLS.output, 
      PREPARE_GENOME_PICARD.output, 
      group_per_sample(bam_for_ASE_ch, PREPARE_VCF_FOR_ASE.output[0]) )
instead of this
Tobias "Tobi" Schraink
@tobsecret
Feb 11 15:00
That's awesome!
Alexander Peltzer
@apeltzer
Feb 11 15:01
wow
just wow
Luca Cozzuto
@lucacozzuto
Feb 11 15:01
!!!!!!!!! WONDERFUL !!!!!!!
Paolo Di Tommaso
@pditommaso
Feb 11 15:01
still need to cleanup some stuff
Harshil Patel
@drpatelh
Feb 11 15:02
Bloody gorgeous
Paolo Di Tommaso
@pditommaso
Feb 11 15:04
I think I'm introducing the first and only NF code style rule: process names should go in uppercase
Tobias "Tobi" Schraink
@tobsecret
Feb 11 15:06
How would one best collect output of a channel into a map? I have a tree of different samples where I need to compare each child node with its parent.
relationships = ['1_1':'1',  '1_2':'1',  '1_2_1':'1_2']
samples = Channel.from(['1', file('1')], ['1_1', file('1_1')], ['1_2', file('1_2')], ['1_2_1', file('1_2_1')])
//Intermediate I thought should be a map: ['1': file('1'), '1_1': file('1_1'), '1_2': file('1_2'), '1_2_1': file('1_2_1')]
//Desired final output is [['1', file('1')], ['1_1', file('1_1')]], [['1', file('1')], ['1_2', file('1_2')]],  [['1_2', file('1_2')], ['1_2_1', file('1_2_1')]]
Luca Cozzuto
@lucacozzuto
Feb 11 15:08
@pditommaso mmm it'll look like our pipelines are shouting... :) DO MAPPING! DO VARIANT CALLING!
Paolo Di Tommaso
@pditommaso
Feb 11 15:09
@tobsecret all the channel content? reduce
Paolo Di Tommaso
@pditommaso
Feb 11 15:12
:joy:
Maxime Garcia
@MaxUlysse
Feb 11 15:17
OMG!
That's wonderful, where can I sign
Harshil Patel
@drpatelh
Feb 11 15:19
All payments should be made to the @drpatelh GitHub account within the next 24 hours. If your credit card fails you will be asked to use another card.
Paolo Di Tommaso
@pditommaso
Feb 11 15:19
ahah
Tobias "Tobi" Schraink
@tobsecret
Feb 11 15:35
@pditommaso but what's the specific syntax?
samples.collect().reduce( { name, file -> [name:file] })
Paolo Di Tommaso
@pditommaso
Feb 11 15:39
samples.reduce( [:] ) { map, item -> /* put in the map */ }
Tobias "Tobi" Schraink
@tobsecret
Feb 11 15:43
So I would have to create an empty map?
emptymap = [:]
samples.reduce( [:] ) { map, item -> emptymap}
Paolo Di Tommaso
@pditommaso
Feb 11 15:44
the empty map is passed already in my example
Tobias "Tobi" Schraink
@tobsecret
Feb 11 15:50
I am confused :sweat_smile:
so where you put in the map, you would just write something like
println(samples.reduce([:]) { map, it -> map[it[0]]=it[1]})
Paolo Di Tommaso
@pditommaso
Feb 11 15:54
Channel.from('hello world'.toList())
      .reduce([:]) { map, ch -> 
        if(map[ch])map[ch]++
        else map[ch]=1
        return map
       }
       .println()
Alexey Dushen
@blacky0x0
Feb 11 15:59
  1. Which IDE or editor is preferable to use while editing *.nf files? Intellij Idea / Visual Studio / Atom / nextflow.ui.console.Nextflow? It's a little bit hard for me to write new pipelines without autocomplete tools.
  2. Will NF support a rigid graph structure instead of dynamic graph evaluation? For instance, there is an example for Argoproj -> https://github.com/argoproj/argo/blob/master/examples/dag-diamond-steps.yaml
  3. Are there any tutorials, courses, or books for studying the syntax of NF and running real-life examples? Sth like Cloud-classes at https://www.katacoda.com/courses/kubernetes
  4. In which areas is NF used? I've found mentions about NF in Chemistry and Bio-tech. Any others?
    Thanks in advance for any answer
Maxime Garcia
@MaxUlysse
Feb 11 16:03
@blacky0x0 You can look at https://github.com/nextflow-io/awesome-nextflow#tutorials for some tutorials
I would also recommand the NF tutorial from nf-core: https://nf-co.re/nextflow_tutorial
for editor, I personally like atom, but then again, you can get at least syntax highlithning in VS and vim as well: https://github.com/nextflow-io/awesome-nextflow#syntax-highlithing
No idea about your second question
Maxime Garcia
@MaxUlysse
Feb 11 16:09
and for the 4th, I'm guessing mainly bioinformatics as you can see from the list
Tobias "Tobi" Schraink
@tobsecret
Feb 11 16:13
@pditommaso aaaaaaah, okidoki, that makes a ton of sense! Thanks a bunch
Paolo Di Tommaso
@pditommaso
Feb 11 16:13
;)
Félix C. Morency
@fmorency
Feb 11 16:17

I think I'm introducing the first and only NF code style rule: process names should go in uppercase

:'(

Paolo Di Tommaso
@pditommaso
Feb 11 16:17
ahah .. why ? :D
Félix C. Morency
@fmorency
Feb 11 16:18
Reminds me of the old CMake coding style back in the days. They changed it afterward. Don't make the same mistake ;)
Paolo Di Tommaso
@pditommaso
Feb 11 16:19
well, having module as function calls, my idea was to use uppercase to highlight you are invoking a process instead of plain function
Tobias "Tobi" Schraink
@tobsecret
Feb 11 16:20
Any sort of convention/ style guide is appreciated - my scripts are all over the place
Phil Ewels
@ewels
Feb 11 16:28
Convention is fine as long as it's not enforced :wink:
Paolo Di Tommaso
@pditommaso
Feb 11 16:28
of course
Phil Ewels
@ewels
Feb 11 16:28
I agree that it helps with readability in your example script above
(which looks awesome by the way :+1:)
Paolo Di Tommaso
@pditommaso
Feb 11 16:29
happy you like it!
Luca Cozzuto
@lucacozzuto
Feb 11 16:31
Hi all! A quick question: is there any way to transform a channel like this [-V hello, -V ciao, -V bonjour]in a string?
Paolo Di Tommaso
@pditommaso
Feb 11 16:32
that looks more a list .. :unamused:
Luca Cozzuto
@lucacozzuto
Feb 11 16:32
I mean in a channel with only one element like this "-V hello, -V ciao, -V bonjour"
Paolo Di Tommaso
@pditommaso
Feb 11 16:32
code
Luca Cozzuto
@lucacozzuto
Feb 11 16:32
you are right... is the result of a collect from a channel
Channel
    .from( 'hello', 'ciao', 'bonjour' )
    .map { "-V ${it}" }
    .set {channelb}

process printmyres {
input:    
set cmdline_pars from channelb.collect()
script:
    """
    print ${channelb}
    """
}
Paolo Di Tommaso
@pditommaso
Feb 11 16:33
maybe you need a new keyboard :D
Luca Cozzuto
@lucacozzuto
Feb 11 16:33
I need to sleep more :)
Paolo Di Tommaso
@pditommaso
Feb 11 16:34
list = [ 'hello', 'ciao', 'bonjour' ]

process printmyres {
script:
    """
    print ${list.collect { "-V $it" }.join(' ')}
    """
}
Luca Cozzuto
@lucacozzuto
Feb 11 16:35
wow thanks a lot
Anthony Ferrari
@af8
Feb 11 16:38
previous_process {
  output: set key, file(out) into out_ch
} 

out_ch = out_ch.groupTuple(by: 0, size: intervals_count, remainder: false)
    .map { id, chunks -> [
        id,
        Channel.fromPath(chunks)
            .collectFile(name: id + '-merged-file', keepHeader: true, skip: 1).getVal().toAbsolutePath()
    ] }

next_process {
  input: set id, file(merge) from out_ch
}
I am trying to use collectFile() on the fly to feed a new process but I am getting a java.util.ConcurrentModificationException. Is there something bad I am doing with collectFile() operator ? thanks
Paolo Di Tommaso
@pditommaso
Feb 11 16:39
let's start from scratch: what are you trying to do ?
Anthony Ferrari
@af8
Feb 11 16:44
I am parallelizing a process (I split the genome in several intervals), each time producing a chunk file together with an analysis key. I first use groupTuple to gather the chunks relevant to same analysis key. After that I want to pass to the next process the merged file along with the analysis key. To this purpose I use collectFile() operator as shown above. But when next_process is called I got an exception java.util.ConcurrentModificationException.
This means (from what I googled) that a modification on a list has been attempted while we are iterating over it.
Paolo Di Tommaso
@pditommaso
Feb 11 16:46
operators cannot be nested each other
Anthony Ferrari
@af8
Feb 11 16:48
You mean the map and the collectFile in my example ?
Paolo Di Tommaso
@pditommaso
Feb 11 16:48
this snippet is wrong
       Channel.fromPath(chunks)
            .collectFile(name: id + '-merged-file', keepHeader: true, skip: 1).getVal().toAbsolutePath()
Anthony Ferrari
@af8
Feb 11 16:52
OK so I guess I would rather create an intermediary process to merge the file instead of using collectFile. Unless you know a smart way to do it with collectFile. Need to improve my Groovy :-)
Paolo Di Tommaso
@pditommaso
Feb 11 16:54
you should fork the channel, apply collectFile on the copy and recombine it, maybe an intermediate process is the simpler
Anthony Ferrari
@af8
Feb 11 16:57
@pditommaso yeah makes sense. I think an intermediary process would also be more readable for others. too much playing with channels just for that is not necessary. thanks
Tobias "Tobi" Schraink
@tobsecret
Feb 11 17:15
For future reference, @pditommaso, I created a gist:
https://gist.github.com/tobsecret/80426d7cf9bddfce4a10a1b479ff7f0a
Caspar
@caspargross
Feb 11 17:57
How do you deal with symlinked input files on another server when running the pipeline with a docker container? From my understanding docker does not follow external symlinks. Is there a way to go around this? In local execution mode everything is fine. When running with docker I get a file not found error.
hydriniumh2
@hydriniumh2
Feb 11 21:25
If I'm using aws batch with an s3 bucket as the 'work' folder when nextflow uploads and downloads files for each process is it checking the file integrity of what I'm downloading? I'm trying to run joint-genotyping on a lot of samples and I'm concerned about pernicious data corruption from repeated downloads and uploads to s3