These are chat archives for nextflow-io/nextflow

16th
May 2018
Sven F.
@sven1103
May 16 2018 11:13
@pditommaso squashed the commits into one.
CI is running now
:)
Pierre Lindenbaum
@lindenb
May 16 2018 12:27
Hi all, Q: is there a way to fix a process 'by hand'. I've got ~10k targets. About 10 targets crashed because the wasn't enough JVM memory for GATK. Is there a way to fix those targets 'by hand' (modifying .command.sh and running with more memory...) and then tell nextflow to resume the pipeline as if the exit status was '0' ?
Maxime Garcia
@MaxUlysse
May 16 2018 12:33
Hi @lindenb you should be able to rerun the failed process from within the work directory
Not sure if it'll work to resume the pipeline
Pierre Lindenbaum
@lindenb
May 16 2018 12:36
@MaxUlysse so if I change .command.sh and I run bash .command.run => exit(0) it would resume the state of the pipeline ?
Maxime Garcia
@MaxUlysse
May 16 2018 12:36
Not sure about that
But I think I'd try that, probably on a test case first to be sure
I'm guessing you don't want to restart eveything for just these 10 targets
For another time, you can also make a retry with more memory as an errorStrategy directive https://www.nextflow.io/docs/latest/process.html#errorstrategy
Pierre Lindenbaum
@lindenb
May 16 2018 12:39

I'm guessing you don't want to restart eveything for just these 10 targets

yesss

yes, for now I've got 10 retry but I'm afraid the allocated memory was not enough anyway.
Luca Cozzuto
@lucacozzuto
May 16 2018 12:40
if you use -resume and increase the memory in the config you'll rerun only the failed one
ehm but you need to change the command line as well... isn't it?
Pierre Lindenbaum
@lindenb
May 16 2018 12:42
@lucacozzuto tell me if i'm wrong: changing memory in my process would change all my 10k targets and they would be re-run, isn't it ?
Luca Cozzuto
@lucacozzuto
May 16 2018 12:43
if you don't touch the command line no... but let me do a test before :)
@lindenb I just tried increasing the memory requirements and it is cached.
the only problem I see is that with GATK you need to touch the command line as well
and in that case I think the cache is invalidated
Pierre Lindenbaum
@lindenb
May 16 2018 12:48
@lucacozzuto anyway, to increase the JVM memory I'll have to change the command line java -Xmx 5g ... (unlesse there I can use something like ${memory}in the script.

I think the cache is invalidated

that's what I want: I want to fool the cache ! :smile:

Luca Cozzuto
@lucacozzuto
May 16 2018 12:50
:) If I were you I'll move the problematic samples in a second folder and re-run the analysis only on them
I bet you will find also other problems with them :)
Paolo Di Tommaso
@pditommaso
May 16 2018 12:53
every under control ?
Pierre Lindenbaum
@lindenb
May 16 2018 12:55
@pditommaso are you talking to me ? what do you mean ?
Paolo Di Tommaso
@pditommaso
May 16 2018 12:56
just wanted to know if you have solved your problem
Pierre Lindenbaum
@lindenb
May 16 2018 12:58
@pditommaso not I'm waiting for the last retryto fail and then I will try something.
Main problem is : can I fool the cache by running a few targets 'by hand' without nextflow and then resume nextflow.
Paolo Di Tommaso
@pditommaso
May 16 2018 12:58
yep, drop the work dir of the tasks you want to re-execute
ahh, wait
in principle yes
change in the work dir, fix it,then run bash .command.run
that should do the trick
Pierre Lindenbaum
@lindenb
May 16 2018 13:01
cool , I'll try and tell you.
Maxime Garcia
@MaxUlysse
May 16 2018 13:34
That wasn't a bad idea after all then ;-)
Tobias Neumann
@t-neumann
May 16 2018 15:30

hi. I'm trying to run a process specifically with a certain docker container (gatk4).

Following the manual I created the config as described:

process {

    publishDir = [path: './results', mode: 'copy', overwrite: 'true']

    errorStrategy = 'retry'
    maxRetries = 3
    maxForks = 20

    cpus = 1
    time = { 1.h * task.attempt }
    memory = { 1.GB * task.attempt }

    withName:gatk {
        container = 'docker://broadinstitute/gatk:4.0.4.0'
    }
}

timeline {
    enabled = true
}

singularity {
    enabled = true
}

my process then does a Mutect2 call:

process gatk {

    tag { parameters.name }

    input:
    val(parameters) from samples

    output:
    file('*.vcf') into outGatk

    shell:
    '''

    gatk Mutect2 \
        -R !{params.ref} \
        -I !{parameters.tumor} \
        -I !{parameters.normal} \
        -tumor !{parameters.name}T \
        -normal !{parameters.name}N \
        -O !{parameters.name}.vcf

    '''
}

When I run it now, it crashes with

Command error:
  .command.sh: line 4: gatk: command not found

Now when I explicitely list the container in the process


process gatk {

    tag { parameters.name }

    container = 'docker://broadinstitute/gatk:4.0.4.0'

.....

it runs fine. What am I doing wrong?

Paolo Di Tommaso
@pditommaso
May 16 2018 15:31
does docker run broadinstitute/gatk:4.0.4.0 gatk works ?
Tobias Neumann
@t-neumann
May 16 2018 15:32
when listed in the process or just the command in a shell?
Paolo Di Tommaso
@pditommaso
May 16 2018 15:32
in the shell
Tobias Neumann
@t-neumann
May 16 2018 15:33
I'm running it via singularity, but yes the image that gets pulled and created via nextflow works
Paolo Di Tommaso
@pditommaso
May 16 2018 15:35
I suspect that when converting to the singularity format it breaks the path
try to run singularity exec docker://broadinstitute/gatk:4.0.4.0 gatk
does it work ?
Tobias Neumann
@t-neumann
May 16 2018 15:41
yes
singularity exec docker://broadinstitute/gatk:4.0.4.0 gatk
Docker image path: index.docker.io/broadinstitute/gatk:4.0.4.0
Cache folder set to /users/tobias.neumann/.singularity/docker
Creating container runtime...
tar: google-cloud-sdk/.wh..wh..opq: implausibly old time stamp 1970-01-01 01:00:00

 Usage template for all tools (uses --spark-runner LOCAL when used with a Spark tool)
    gatk AnyTool toolArgs

 Usage template for Spark tools (will NOT work on non-Spark tools)
    gatk SparkTool toolArgs  [ -- --spark-runner <LOCAL | SPARK | GCS> sparkArgs ]

 Getting help
    gatk --list       Print the list of available tools

    gatk Tool --help  Print help on a particular tool

 Configuration File Specification
     --gatk-config-file                PATH/TO/GATK/PROPERTIES/FILE

 gatk forwards commands to GATK and adds some sugar for submitting spark jobs

   --spark-runner <target>    controls how spark tools are run
     valid targets are:
     LOCAL:      run using the in-memory spark runner
     SPARK:      run using spark-submit on an existing cluster
                 --spark-master must be specified
                 --spark-submit-command may be specified to control the Spark submit command
                 arguments to spark-submit may optionally be specified after --
     GCS:        run using Google cloud dataproc
                 commands after the -- will be passed to dataproc
                 --cluster <your-cluster> must be specified after the --
                 spark properties and some common spark-submit parameters will be translated
                 to dataproc equivalents

   --dry-run      may be specified to output the generated command line without running it
   --java-options 'OPTION1[ OPTION2=Y ... ]'   optional - pass the given string of options to the
                 java JVM at runtime.
                 Java options MUST be passed inside a single string with space-separated values.
Paolo Di Tommaso
@pditommaso
May 16 2018 15:44
weird, it should work if so
check the .command.run for that task, is there the singularity command line ?
Stephen Kelly
@stevekm
May 16 2018 16:37
is there a way to download the Nextflow documentation for offline usage?
Paolo Di Tommaso
@pditommaso
May 16 2018 16:38
cd docs && make html
?
Stephen Kelly
@stevekm
May 16 2018 16:39
thanks
Paolo Di Tommaso
@pditommaso
May 16 2018 16:39
welcome
Stephen Kelly
@stevekm
May 16 2018 16:42
ah you need sphinx to build the docs though
Paolo Di Tommaso
@pditommaso
May 16 2018 16:59
yep
python ;)