These are chat archives for nextflow-io/nextflow

1st
Mar 2018
amacbride
@amacbride
Mar 01 2018 00:00
Howdy @pditommaso -- I'm trying to modularize some of my template scripts, but wasn't sure how to specify the location of my common library of BASH functions in my NF scripts. So if I have the following NF script:
process meta {
        set echo

        script:
                template "library.sh"
}
With this template:
#!/bin/bash

source common.sh

test_echo
How would I locate the following?
# bash
# library of common function for Nextflow template scripts

test_echo () {
        echo "foo"
        date
}
amacbride
@amacbride
Mar 01 2018 00:18
I tried putting common.sh in the templates directory, and tried the source directive with both common.sh and templates/common.sh, but neither worked.
Any pointers?
amacbride
@amacbride
Mar 01 2018 00:45
This seemed to work, but I wasn't sure if it was the right way to do it:
source $workflow.projectDir/templates/common.sh
Daniel E Cook
@danielecook
Mar 01 2018 04:38
Has anyone taken a crack at syntax highlighting in Sublime Text?
The atom spec looks like it could be ported over relatively easily
This is a start:
%YAML 1.2
---
# See http://www.sublimetext.com/docs/3/syntax.html
file_extensions:
  - nf

scope: source.groovy

contexts:
  # The prototype context is prepended to all contexts but those setting
  # meta_include_prototype: false.

  main:
    # The main context is the initial starting point of our syntax.
    # Include other contexts from here (or specify them directly).
    - include: process-def
    - include: source.groovy

  process-def:
    - scope: support.process
      begin: ^\s*(process)\s+(\w+|"[^"]+"|\'[^\']+\')\s*
      beginCaptures:
        '1': {name: support.keyword.nextflow}
        '2': {name: support.function.nextflow}
      end: \}
      include: process-body

  process-body:
    - match: (?:afterScript|beforeScript|cache|container|cpus|clusterOptions|disk|echo|errorStrategy|executor|ext|maxErrors|maxForks|maxRetries|memory|module|penv|publishDir|queue|scratch|storeDir|stageInMode|stageOutMode|tag|time|validExitStatus)\b
      scope: 
        support.process.directive.type.nextflow
        markup.bold
    - match: '(?:input|output|script|shell|exec):'
      scope: 
        support.constant.block.nextflow
        markup.bold
Paolo Di Tommaso
@pditommaso
Mar 01 2018 06:40
@danielecook that's nice, but you need also the groovy part of the grammar
Daniel E Cook
@danielecook
Mar 01 2018 06:41
Yes - I saw you had a custom file for that...
Daniel E Cook
@danielecook
Mar 01 2018 06:41
I’ll try to plug it in
Paolo Di Tommaso
@pditommaso
Mar 01 2018 06:41
if you are willing to setup and test a sublime text package I'm happy to assist you
@amacbride use baseDir instead eg. source $baseDir/templates/common.sh
Daniel E Cook
@danielecook
Mar 01 2018 06:54
Thanks I’ve got groovy included…some lingering issues
Paolo Di Tommaso
@pditommaso
Mar 01 2018 06:55
what sort of?
Daniel E Cook
@danielecook
Mar 01 2018 06:55
issues with comments
sorry…triple quoted blocks with funky chars
Paolo Di Tommaso
@pditommaso
Mar 01 2018 06:55
for example?
Daniel E Cook
@danielecook
Mar 01 2018 06:56
Screen Shot 2018-03-01 at 12.56.11 AM.png
pardon the mess...
this code block breaks the highlighting for all subsequent processes
It works fine in Atom btw
so something is not matching up
Paolo Di Tommaso
@pditommaso
Mar 01 2018 06:56
well, that's not a mess, it's bioinformatics ! :)
Daniel E Cook
@danielecook
Mar 01 2018 06:57
haha thanks
I am missing the codeblock line you have - so looking at how to add that
I believe it will clear it up
Paolo Di Tommaso
@pditommaso
Mar 01 2018 06:57
does it work it you use the groovy language in sublime text ?
Daniel E Cook
@danielecook
Mar 01 2018 06:57
no - same issue
Paolo Di Tommaso
@pditommaso
Mar 01 2018 06:58
ah, if so it's an problem in the groovy part
quite complicated
Daniel E Cook
@danielecook
Mar 01 2018 06:59
I tried to find a translater btwn atom and subilme…the systems for defining syntax seem very similar
I thought the code block section you have for atom might have overriden the groovy issues
Paolo Di Tommaso
@pditommaso
Mar 01 2018 07:00
the structure seems the same, only change the format
Daniel E Cook
@danielecook
Mar 01 2018 07:00
There are a couple of differences
but they are subtle
Paolo Di Tommaso
@pditommaso
Mar 01 2018 07:00
try to use the one for VScode, it's json based
it should be easier to convert automatically to yaml
also the sublime grammar for groovy looks shorter than the one in VScode/atom, I guess it's less accurate
Daniel E Cook
@danielecook
Mar 01 2018 07:11
I think sublime is expecting xml not json
Paolo Di Tommaso
@pditommaso
Mar 01 2018 07:17
no wait, this is yaml
so why xml ?
Daniel E Cook
@danielecook
Mar 01 2018 07:23
I think if you are using a .tmLanguage extension, sublime expects xml. If you use .sublime-syntax the format is YAML
The console error in sublime is error parsing lexer: Packages/nextflow.tmLanguage: Error parsing plist xml: expected < in file Packages/nextflow.tmLanguage on line: 1
Paolo Di Tommaso
@pditommaso
Mar 01 2018 07:23
well, extension can be easily changed :)
Daniel E Cook
@danielecook
Mar 01 2018 07:24
same result with .json
Paolo Di Tommaso
@pditommaso
Mar 01 2018 07:24
I see
Daniel E Cook
@danielecook
Mar 01 2018 07:32
Thanks for taking a look. I will try to return to this later.
Paolo Di Tommaso
@pditommaso
Mar 01 2018 07:32
:+1:
you are welcome
Bioninbo
@Bioninbo
Mar 01 2018 08:27

Hello. I get an error:

Command error:
  .command.stub: line 38: /dev/fd/62: No such file or directory
  ps: bad -o argument 'state', supported arguments: user,group,comm,args,pid,ppid,pgid,tty,vsz,stat,rss
[E::sam_parse1] SEQ and QUAL are of different length
  [W::sam_read1] Parse error at line 63723
  [main_samview] truncated file.

On line 38, there is a call to the funtion walk() (withing nxf_tree()). Do you think error 1 (/dev/fd/62 missing) and error 2 (truncated file) are linked? And how to fix error 1? I tried ln -s /proc/self/fd /dev/fd but it did not help.

Paolo Di Tommaso
@pditommaso
Mar 01 2018 08:28
are you using a container ?
Bioninbo
@Bioninbo
Mar 01 2018 08:32
yes
Paolo Di Tommaso
@pditommaso
Mar 01 2018 08:33
is it a biocontainer ?
Bioninbo
@Bioninbo
Mar 01 2018 08:33
yes I got it from the galaxy depot
Paolo Di Tommaso
@pditommaso
Mar 01 2018 08:34
if so it's this, nextflow-io/nextflow#499
they use a old version of ps
you need to upgrade it
but in any case I don't think it breaks your task
it should only not get mem, cpu usage metrics
Bioninbo
@Bioninbo
Mar 01 2018 08:35
Ah I see
Thats why this is blank in my reports
Thanks for that!
Paolo Di Tommaso
@pditommaso
Mar 01 2018 08:36
:+1:
Tintest
@Tintest
Mar 01 2018 13:20

Hi there, is it possible to specify several container for only one process with multiple tools ? Until now it was, one process, one tool / container. For example, this is how i'm doing in my config file :

singularity.enabled = true
 singularity.runOptions = "--bind $PWD:$PWD --bind ${params.resultDir}:${params.resultDir} --bind ${params.sequenceDir}:${params.sequenceDir} --bind ${params.runDir}:${params.runDir}"



 process {

   $bcl2fastq {
     singularity.runOptions = "${singularity.runOptions} "
     container = "${params.singularityDir}/bcl2fastq-v2.20.0.simg"
   }
}

Thank you :)

Paolo Di Tommaso
@pditommaso
Mar 01 2018 13:22
yes, but if setting that in the config file you need to use an explicit closure eg.
...process = {"$params.singularityDir/bcl2fastq-v2.20.0.simg" }
Tintest
@Tintest
Mar 01 2018 13:23
Errrr ... Can you make an example with 2 container name ?
Tintest
@Tintest
Mar 01 2018 13:28
I think there is a misunderstanding in my question. I don't want to use a container with two tools in it, I want to use two separates container conainting one tool each for a same processus. Or maybe I am the who doesn't understand answer btw :p
Paolo Di Tommaso
@pditommaso
Mar 01 2018 13:30
in the same process execution or a different container for a different run of the same process ?
Tintest
@Tintest
Mar 01 2018 13:30
in the same process execution, example running a bwa container and a samtools container in the same process
Alexander Peltzer
@apeltzer
Mar 01 2018 13:31
IIRC, this is not possible to have two different containers for a single process
Paolo Di Tommaso
@pditommaso
Mar 01 2018 13:31
you need to split it in the processes, if you want to use different containers images
Tintest
@Tintest
Mar 01 2018 13:31
Ok, I just wanted to be sure
I will build a single container for both tools then, sorry
Thanks alot ! :)
Paolo Di Tommaso
@pditommaso
Mar 01 2018 13:32
much better ;)
Tintest
@Tintest
Mar 01 2018 13:32
Do you recommend one container / one processus, or one giant container with all the tools needed for my pipeline ?
Paolo Di Tommaso
@pditommaso
Mar 01 2018 13:33
I find more easier to maintain one fat container
Tintest
@Tintest
Mar 01 2018 13:34
Mmmmmh I see, but if I want to change only one component I just have to pull only one small singularity image. I'm a either a newbie in containers so, I don't think I will build "clean" containers, that's why I prefer using pre-built container :D
Like Biocontainers
Paolo Di Tommaso
@pditommaso
Mar 01 2018 13:36
you asked for my opinion, I gave you :)
Tintest
@Tintest
Mar 01 2018 13:37
Okok thanks :)
Paolo Di Tommaso
@pditommaso
Mar 01 2018 13:37
recently I'm building containers like this
FROM continuumio/miniconda
RUN conda config --add channels defaults \
 && conda config --add channels conda-forge \
 && conda config --add channels bioconda \
 && conda install -y picard=2.9 bwa=0.7.15 fastqc=0.11.5 sambamba=0.6.6
Alexander Peltzer
@apeltzer
Mar 01 2018 13:37
I have to admit that this is currently the easiest way
Tintest
@Tintest
Mar 01 2018 13:37
pretty easy indeed :)
Alexander Peltzer
@apeltzer
Mar 01 2018 13:38
  • you can pipe between individual tools + you can run a software versions collection script in a single process in the end e.g. for multiqc ;-)
Tintest
@Tintest
Mar 01 2018 13:39
I see sambamba there, do you think sambamba can really replace samtools for the most basics tasks ? When I first discovered sambamba there where some people arguing the pileup results where discordant. But I mainly use sambamba for flagstat, view and sort ...
Paolo Di Tommaso
@pditommaso
Mar 01 2018 13:40
@emi80 is our sambamba guru :)
Tintest
@Tintest
Mar 01 2018 13:42
Thanks for your Docker cheatsheet :) I'm going to build one fat container :D
Paolo Di Tommaso
@pditommaso
Mar 01 2018 13:43
:+1:
you are welcome
rfenouil
@rfenouil
Mar 01 2018 14:37
Hello all,
in a script block, I am trying to create a file with groovy commands before accessing it with the bash commands. However it seems that bash processes cannot see this file (does not seem to be created in the work folder).
Is it something that I should avoid to do ?
Paolo Di Tommaso
@pditommaso
Mar 01 2018 14:38
provide an example please
rfenouil
@rfenouil
Mar 01 2018 14:39
yes sorry, it looks like that:
    script:
    // Create a file containing samples names
    def sampleNamesFile  = new File('sampleNames.txt')
    sampleNamesFile.text = sampleNames.join("\n")
    // Extract information for multiQC config and sample renaming files
    """
    title=\$(parseSampleNames.R -i $sampleNamesFile -c title)
    echo \$title > example.txt
    """
$sampleNamesFile is correctly replaced by 'sampleNames.txt' name but it does seem to exist in the work folder at all
Paolo Di Tommaso
@pditommaso
Mar 01 2018 14:42
you cannot do that
use the heredoc syntax to create a file in the bash script
I think it would even work a simple echo
"""
echo "${sampleNames.join("\n")}" > sampleNames.txt
title=\$(parseSampleNames.R -i $sampleNamesFile -c title)
echo \$title > example.txt
"""
rfenouil
@rfenouil
Mar 01 2018 14:45
Ok thank you that's what my initial script was doing but I wasn't confortable generating such a big shell command when having tons of samples, isn't there a size limit for that ?
Paolo Di Tommaso
@pditommaso
Mar 01 2018 14:48
there's no an strict upper limit, but up to a few MBs it should be fine, above in particular if you have many jobs, it could smell
rfenouil
@rfenouil
Mar 01 2018 14:53
Ok that should be enough for my needs, thank you very much !
Vladimir Kiselev
@wikiselev
Mar 01 2018 15:08
Hi Paolo! My process for each of its inputs produces multiple equivalent files, each of which is an input to the next process. Can I setup nextflow in a way that these files are piped in the second process, so that I don’t need to save them to disc?
Paolo Di Tommaso
@pditommaso
Mar 01 2018 15:10
if you want to stream via unix pipes, you need to do that in the same process
Vladimir Kiselev
@wikiselev
Mar 01 2018 15:11
no, maybe I am not clear enough, will try to explain
Paolo Di Tommaso
@pditommaso
Mar 01 2018 15:11
I think I get it
if you don't want to save to a file, you need to stream ie. pipe the output, right?
Vladimir Kiselev
@wikiselev
Mar 01 2018 15:13

yes, so when I do e.g. this

output:
    file "*.fastq" into fastq_fastqc

it means that in this process the resulting fastq files were saved to the disc, right?

Paolo Di Tommaso
@pditommaso
Mar 01 2018 15:14
that means, takes all files with .fastq extension and send over the fastq_fastqc channel (that will be used by another process)
so it just passes the file paths
Vladimir Kiselev
@wikiselev
Mar 01 2018 15:20
sorry, I think I figured it out, need to spend more time running NF!
Paolo Di Tommaso
@pditommaso
Mar 01 2018 15:21
what't the problem with that ?
Vladimir Kiselev
@wikiselev
Mar 01 2018 15:21
so, all the files created in the process will be save within the work directory unless I use publishDir, right?
which I can delete afterwards
Paolo Di Tommaso
@pditommaso
Mar 01 2018 15:22
all the files created in the process will always be save within the work
independently the use of publishDir
the only thing that affect this behaviour is the use of the process.scratch setting
in the case all process files are created in scratch path in the local storage, then only the ones declared as output are copied to the work dir
Vladimir Kiselev
@wikiselev
Mar 01 2018 15:25
Yep, I see, many thanks. scratch looks very useful. Sorry for taking your time with these silly questions, don’t have enough experience yet… And trying to optimise as much as possible.
Paolo Di Tommaso
@pditommaso
Mar 01 2018 15:25
you are welcome
Phil Ewels
@ewels
Mar 01 2018 15:41
@wikiselev - be a little careful with the settings of publishDir. It's been a while, but I think the default is to softlink there, so if you then rm -rf work you'll lose everything. We specifically copy the files out - means data duplication but makes cleaning up & data delivery easier.
Vladimir Kiselev
@wikiselev
Mar 01 2018 15:42
great, thanks, it’s very good to know!
Paolo Di Tommaso
@pditommaso
Mar 01 2018 15:46
whenever possible hardlink publish option is the best way to go
Karin Lagesen
@karinlag
Mar 01 2018 15:48
I am a bit confused.
Paolo Di Tommaso
@pditommaso
Mar 01 2018 15:49
I'm a lot :)
Karin Lagesen
@karinlag
Mar 01 2018 15:49
wildcards in input and output channels, do I always have to have a period behind?
:D
Paolo Di Tommaso
@pditommaso
Mar 01 2018 15:49
period behind? show an example
Karin Lagesen
@karinlag
Mar 01 2018 15:49
ok, so I have this as my input channel
set pair_id, file("${pair_id}*_concat.fq.gz") from reads.view()
and this is the files I get
Angen-bacDNA2-78-2013-01-4718_S291_concat.fq.gz Angen-bacDNA2-78-2013-01-4718_S292_concat.fq.gz
what I wanted was
Angen-bacDNA2-78-2013-01-4718_S29_R1_concat.fq.gz Angen-bacDNA2-78-2013-01-4718_S29_R2_concat.fq.gz
I then read the wildcard section in the manual, and there I noticed that there was always a period after the wildcard character, thus my question
Paolo Di Tommaso
@pditommaso
Mar 01 2018 15:51
(just a sec)
Paolo Di Tommaso
@pditommaso
Mar 01 2018 16:03
sorry I was interrupted
Bioninbo
@Bioninbo
Mar 01 2018 16:03
Hello everyone. I have a question concerning Nextflow and R. Do you think the tool is appropriate for (exploratory) analysis in R? For instance, having to load multiple times the same libraries or R objects in different processes can induce significant delays. Or maybe there is a cleverer way to avoid this redundancy?
Paolo Di Tommaso
@pditommaso
Mar 01 2018 16:04
@karinlag you should not use * in the input declaration
Karin Lagesen
@karinlag
Mar 01 2018 16:04
ok, so I can't use wildcards for input, only for output?
Paolo Di Tommaso
@pditommaso
Mar 01 2018 16:05
and you don't need either period after the wildcard,
Karin Lagesen
@karinlag
Mar 01 2018 16:06
I\m confused in that case, because I distinctly get the sense from the manual that wildcards can be used for inputs
Paolo Di Tommaso
@pditommaso
Mar 01 2018 16:06
the wildcard in the input file name means stage the input file with the provided string and replace the wildcard with the input i-th index
Karin Lagesen
@karinlag
Mar 01 2018 16:06
ah, ok
In that case, this section is a bit unclear
Paolo Di Tommaso
@pditommaso
Mar 01 2018 16:07
improvements are welcome ! :)
Karin Lagesen
@karinlag
Mar 01 2018 16:07
@pditommaso and thanks for the clarification!
Paolo Di Tommaso
@pditommaso
Mar 01 2018 16:08
you are welcome
(going offline)
Mike Smoot
@mes5k
Mar 01 2018 17:55

Hi @pditommaso (whenever you get back), I'm still debugging my slurm nightmare and I'm seeing this line right before my pipeline stops:

Mar-01 00:24:43.674 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: slurm)

I understand the barrier stuff for processes (i.e. when a process has finished running for all values in the channel), but not how it relates to the executor. Can you provide any insight?

Paolo Di Tommaso
@pditommaso
Mar 01 2018 18:01
it signals there are any more task to compute in the NF queue
Mike Smoot
@mes5k
Mar 01 2018 18:19
Ok, that makes sense. Since this is a failure, I guess session.isTerminated() or session.isAborteded() must be returning true.
Paolo Di Tommaso
@pditommaso
Mar 01 2018 18:19
yep
this is always sbatch submit error right ?
Mike Smoot
@mes5k
Mar 01 2018 18:23
Yup, I'm just running into the maxErrors limit. It seems as though when machines come/go from the cluster slurm freaks out for a couple of seconds and I get bursts of sbatch failures. I thought I'd increased maxErrors for this run, but apparently not.
AWS Batch, here I come! :)
Paolo Di Tommaso
@pditommaso
Mar 01 2018 18:24
:)
I would give a try also to k8s
it's a promising stack
Mike Smoot
@mes5k
Mar 01 2018 18:26
Good idea! That would be a good excuse to spend some time learning about kubernetes.
Paolo Di Tommaso
@pditommaso
Mar 01 2018 18:27
the interesting part of k8s is the storage abstraction
Mike Smoot
@mes5k
Mar 01 2018 18:30
Maybe once I have more than a picosecond of spare time I'll dig in. :)
Paolo Di Tommaso
@pditommaso
Mar 01 2018 18:30
ahah
in particular
Mike Smoot
@mes5k
Mar 01 2018 18:34
Do you have any books or websites you'd recommend for learning about k8s?
besides kubernetes.io?
Paolo Di Tommaso
@pditommaso
Mar 01 2018 18:35
the official docs is a mess
one very popular is Kubernetes Up & Running
it's a very good start
Mike Smoot
@mes5k
Mar 01 2018 18:37
excellent, thanks for the pointers!
Paolo Di Tommaso
@pditommaso
Mar 01 2018 18:37
:+1:
Shawn Rynearson
@srynobio
Mar 01 2018 22:51

I have a question regarding aws-batch and docker runs.

If I want to add an ephemeral drive to a EC2 instance, docker doesn't seem to self discover the drive as it does with EBS. However reviewing the nextflow docker section it seems that I can add a temp directory. With this carry over to aws-sbatch.

Example:

process {
    $fastqc {
        cpus = 2
        memory = '5GB'
        queue = 'run-queue'
        docker.temp = '/media/ephemeral0'
        container = '726197484957.dkr.ecr.us-west-2.amazonaws.com/qc'
    }
}
Also I noticed that the command.run script will copy any data from s3 to the EC2 ./ directory. Would the above docker.temp allow enough space for ~>200GB files?