These are chat archives for nextflow-io/nextflow

11th
Aug 2016
Tiffany Delhomme
@tdelhomme
Aug 11 2016 09:03
Hi Paolo, I have the same problem than last friday, my process is re-submitted but with an exitcode=0 and no error in the .command.err........
Tiffany Delhomme
@tdelhomme
Aug 11 2016 09:25
I can not find the cause of this, I have a large .command.log because of around 10000 lines of output, do you think this could be the reason?
Paolo Di Tommaso
@pditommaso
Aug 11 2016 09:44
do u have an accounting tool in your cluster
@tdelhomme I mean, a tool that report that cluster runtime info for that job?
Tiffany Delhomme
@tdelhomme
Aug 11 2016 11:34

this?

Summary of time in seconds spent in various states:
JOBID   USER    JOB_NAME  PEND    PSUSP   RUN     USUSP   SSUSP   UNKWN   TOTAL
189683  delhomm *vcf_(1)  318     0       68575   0       0       0       68893

that means 19 hours running

this is also in the .command.log so maybe you was not talking about this
we also have the precise hour in the log
Started at Wed Aug 10 15:10:26 2016
Results reported at Thu Aug 11 10:13:21 2016
Tiffany Delhomme
@tdelhomme
Aug 11 2016 11:57
here is the complete info from the cluster:
Job <189683>, Job Name <nf-annotate_vcf_(1)>, User <delhommet>, Project <defaul
                     t>, Command <#!/bin/bash;#BSUB -o /mnt/beegfs/delhommet/ne
                     edlestack_callings/annotate_GATK_vcf/myeloma/work/1e/58592
                     e36e990c0f0e9a9d9ce845ecb/.command.log;#BSUB -M 8192;#BSUB
                      -R "rusage[mem=8192]";#BSUB -J nf-annotate_vcf_(1);#BSUB 
                     -m "cn01 cn02 cn03 cn04 cn05 cn06 cn07 cn08 cn09 cn10 cn11
                      cn12"; # NEXTFLOW TASK: annotate_vcf (1);set -e;set -u;NX
                     F_DEBUG=${NXF_DEBUG:=0}; [[ $NXF_DEBUG > 2 ]] && set -x; n
                     xf_env() {;    echo '============= task environment ======
                     =======';    env | sort | sed "s/\(.*\)AWS\(.*\)=\(.\{6\}\
                     ).*/\1AWS\2=\3xxxxxxxxxxxxx/";    echo '============= task
                      output ==================';}; nxf_kill() {;    declare -a
                      ALL_CHILD;    while read P PP;do;        ALL_CHILD[$PP]+=
                     " $P";    done < <(ps -e -o pid= -o ppid=);     walk() {; 
                            [[ $1 != $$ ]] && kill $1 2>/dev/null || true;     
                        for i in ${ALL_CHILD[$1]:=}; do walk $i; done;    };   
                       walk $1;}; function nxf_mktemp() {;    local base=${1:-/
                     tmp};    if [[ $(uname) = Darwin ]]; then mktemp -d $base/
                     nxf.XXXXXXXXXX;    else TMPDIR="$base" mktemp -d -t nxf.XX
                     XXXXXXXX;    fi;}; on_exit() {;  exit_status=${ret:=$?};  
                     printf $exit_status > /mnt/beegfs/delhommet/needlestack_ca
                     llings/annotate_GATK_vcf/myeloma/work/1e/58592e36e990c0f0e
                     9a9d9ce845ecb/.exitcode;  set +u;  [[ "$COUT" ]] && rm -f 
                     "$COUT" || true;  [[ "$CERR" ]] && rm -f "$CERR" || true; 
                      exit $exit_status;}; on_term() {;    set +e;    [[ "$pid"
                      ]] && nxf_kill $pid;}; trap on_exit EXIT;trap on_term TER
                     M INT USR1 USR2; [[ $NXF_DEBUG > 0 ]] && nxf_env;touch /mn
                     t/beegfs/delhommet/needlestack_callings/annotate_GATK_vcf/
                     myeloma/work/1e/58592e36e990c0f0e9a9d9ce845ecb/.command.be
                     gin;[ -f /mnt/beegfs/delhommet/needlestack_callings/annota
                     te_GATK_vcf/myeloma/work/1e/58592e36e990c0f0e9a9d9ce845ecb
                     /.command.env ] && source /mnt/beegfs/delhommet/needlestac
                     k_callings/annotate_GATK_vcf/myeloma/work/1e/58592e36e990c
                     0f0e9a9d9ce845ecb/.command.env;rm -f split1.vcf.bgz;ln -s 
                     /mnt/beegfs/delhommet/needlestack_callings/annotate_GATK_v
                     cf/myeloma/work/c6/f7bd92383f66e2bdb538ca2ecd82c8/split1.v
                     cf.bgz split1.vcf.bgz;  set +e;COUT=$PWD/.command.po; mkfi
                     fo "$COUT";CERR=$PWD/.command.pe; mkfifo "$CERR";tee .comm
                     and.out < "$COUT" & tee1=$!;tee .command.err < "$CERR" >&2
                      & tee2=$!;(;/bin/bash /mnt/beegfs/delhommet/needlestack_c
                     allings/annotate_GATK_vcf/myeloma/work/1e/58592e36e990c0f0
                     e9a9d9ce845ecb/.command.run.1;) >"$COUT" 2>"$CERR" & pid=$
                     !;wait $pid || ret=$?;wait $tee1 $tee2>
Wed Aug 10 15:05:08: Submitted from host <hn>, to Queue <normal>, CWD </mnt/bee
                     gfs/delhommet/needlestack_callings/annotate_GATK_vcf/myelo
                     ma/work/1e/58592e36e990c0f0e9a9d9ce845ecb>, Output File </
                     mnt/beegfs/delhommet/needlestack_callings/annotate_GATK_vc
                     f/myeloma/work/1e/58592e36e990c0f0e9a9d9ce845ecb/.command.
                     log>, Requested Resources <rusage[mem=8192]>, Specified Ho
                     sts <cn01>, <cn02>, <cn03>, <cn04>, <cn05>, <cn06>, <cn07>
                     , <cn08>, <cn09>, <cn10>, <cn11>, <cn12>;

 MEMLIMIT
      8 G 
Wed Aug 10 15:10:26: Dispatched to <cn08>, Effective RES_REQ <select[(type == a
                     ny )] order[slots] rusage[mem=8192.00] span[hosts=1] >;
Wed Aug 10 15:10:26: Starting (Pid 6022);
Wed Aug 10 15:10:26: Running with execution home </home/delhommet>, Execution C
                     WD </mnt/beegfs/delhommet/needlestack_callings/annotate_GA
                     TK_vcf/myeloma/work/1e/58592e36e990c0f0e9a9d9ce845ecb>, Ex
                     ecution Pid <6022>;
Thu Aug 11 10:13:21: Done successfully. The CPU time used is 72536.9 seconds;
Thu Aug 11 10:13:22: Post job process done successfully;

MEMORY USAGE:
MAX MEM: 4.8 Gbytes;  AVG MEM: 3.9 Gbytes

Summary of time in seconds spent in various states by  Thu Aug 11 10:13:22
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  318      0        68575    0        0        0        68893
Paolo Di Tommaso
@pditommaso
Aug 11 2016 11:59
can you share the .nextflow.log file ?
(not here because it's too big I guess..)
Paolo Di Tommaso
@pditommaso
Aug 11 2016 12:11
um, I was saying .nextflow.log not the .command.log
Tiffany Delhomme
@tdelhomme
Aug 11 2016 12:12
oups ok
I have this in my config
process {
                   executor = 'lsf'
                   scratch = false
                   errorStrategy = 'retry'
                   maxRetries = 4
                   maxErrors = 10000
                   clusterOptions = '-m "cn01 cn02 cn03 cn04 cn05 cn06 cn07 cn08 cn09 cn10 cn11 cn12"'
                   $samtools_mpileup {
                      memory = { 4.GB * task.attempt }
                   }
                   $annotate_vcf {
                      memory = { 4.GB * task.attempt }
                   }
Paolo Di Tommaso
@pditommaso
Aug 11 2016 12:17
is your pipeline still running?
Tiffany Delhomme
@tdelhomme
Aug 11 2016 12:18
yes
Paolo Di Tommaso
@pditommaso
Aug 11 2016 13:26
@tdelhomme Something strange is happening. Could you remove the retry option and launch another run to see the error produced by nextflow for that job?
Tiffany Delhomme
@tdelhomme
Aug 11 2016 13:30

ok i will remove

errorStrategy = 'retry'
maxRetries = 4
maxErrors = 10000

and relanch, but each job takes 20hours...

Paolo Di Tommaso
@pditommaso
Aug 11 2016 13:32
I see, tomorrow if so ..
don't you have smaller dataset just to try ?
Tiffany Delhomme
@tdelhomme
Aug 11 2016 13:41
Yes I have tried with a part of my big output file and everything was fine... but there was no need to retry for this small file (enough memory at the first submission)
Paolo Di Tommaso
@pditommaso
Aug 11 2016 13:43
Ah wait, I was missing this. In some cases the retry is fine because jobs don't have enough resources ..
Tiffany Delhomme
@tdelhomme
Aug 11 2016 13:44
I did not get it :worried:
Paolo Di Tommaso
@pditommaso
Aug 11 2016 13:45
I mean you are expecting some jobs to be re-tried
because you increase the mem for them
    $annotate_vcf {
                      memory = { 4.GB * task.attempt }
                   }
however you have 200 Re-submitted process > annotate_vcf in your log file
there's clearly too little mem
I would suggest to increate the mem at least at 8.GB if not 16.GB
Tiffany Delhomme
@tdelhomme
Aug 11 2016 13:47
yes all of my 200 jobs have been retried but I do not understand why this causes a problem here...
I have increased to 6GB I think it's enought but I still do not understand the problem, for each job not enough memory the first time then we increase and this must be ok, didn't it?
Paolo Di Tommaso
@pditommaso
Aug 11 2016 13:52
yes, but if it's too low, there's the risk that all of them will fail and resubmitted
how much memory reports the trace file produced by nextflow ?
Tiffany Delhomme
@tdelhomme
Aug 11 2016 13:56

5.3GB
In the trace I have this line for the process re-submitted but successfully completed:

2    1e/58592e    189683  annotate_vcf (1)        FAILED  0    2016-08-10 15:05:08.228 19h 8m 14s    19h 2m 51s    99.9%   4.9 GB  5.3 GB  155.9 MB        751.7 MB

status failed, strange?

Paolo Di Tommaso
@pditommaso
Aug 11 2016 13:57
a bit
In the log file there's this
Aug-10 15:05:07.165 [Thread-4] DEBUG nextflow.executor.GridTaskHandler - File is returning an empty content TaskHandler[jobId: 189436; id: 2; name: annotate_vcf (1); status: RUNNING; exit: -; workDir: /mnt/beegfs/delhommet/needlestack_callings/annotate_GATK_vcf/myeloma/work/3f/2982984c1d0ff47529d1957239f70d started: 1470812955279; exited: 2016-08-10T13:05:07Z; ] -- Try to wait a while and .. pray.

Aug-10 15:05:08.136 [pool-1-thread-2] DEBUG nextflow.executor.GridTaskHandler - Launching process > annotate_vcf (1) -- work folder: /mnt/beegfs/delhommet/needlestack_callings/annotate_GATK_vcf/myeloma/work/1e/58592e36e990c0f0e9a9d9ce845ecb
Aug-10 15:05:08.228 [pool-1-thread-2] INFO  nextflow.Session - [1e/58592e] Re-submitted process > annotate_vcf (1)
It seems that when it tries to read the exit file it return an empty content
Tiffany Delhomme
@tdelhomme
Aug 11 2016 14:08
yes but this re-submission is ok because need more memory (this is the first re-submission)
this is here the second re-submission not explained https://github.com/tdelhomme/data_tests/blob/master/.nextflow.log#L1184
Paolo Di Tommaso
@pditommaso
Aug 11 2016 14:09
actually no, it was launched also at 9:09
Aug-10 09:09:13.455 [Actor Thread 5] DEBUG nextflow.executor.GridTaskHandler - Launching process > annotate_vcf (1) -- work folder: /mnt/beegfs/delhommet/needlestack_callings/annotate_GATK_vcf/myeloma/work/3f/2982984c1d0ff47529d1957239f70d
Aug-10 09:09:13.535 [Actor Thread 5] INFO  nextflow.Session - [3f/298298] Submitted process > annotate_vcf (1)
can you look in the trace file for the line with 3f/298298 ?
Tiffany Delhomme
@tdelhomme
Aug 11 2016 14:12
Yes at 9:09 the first submission, with 4GB memory, at 15:05 with 8GB (successfully completed) and at 10:13 next day re-submitted
Yes I will check that
2    3f/298298    189436    annotate_vcf (1)    FAILED    130    2016-08-10 09:09:13.534    5h 55m 55s    5h 55m 53s    97.2%    4.1 GB    4.5 GB    106.1 MB    144.1 MB
Paolo Di Tommaso
@pditommaso
Aug 11 2016 14:13
ok, this reports 130 correctly
Tiffany Delhomme
@tdelhomme
Aug 11 2016 14:13
Yes!
Paolo Di Tommaso
@pditommaso
Aug 11 2016 14:48
@tdelhomme In the folder /mnt/beegfs/delhommet/needlestack_callings/annotate_GATK_vcf/myeloma/work/1e/58592e36e990c0f0e9a9d9ce845ecb there are the *.pdf and *.vcf files?
Paolo Di Tommaso
@pditommaso
Aug 11 2016 14:56
In my opinion what it happening is that the job complete but nextflow doesn't file the expected file due to a latency in the file system, thus it reports an error and re-execute the task
I'm thinking this because there multiple entry like File is returning an empty content in the log file
Tiffany Delhomme
@tdelhomme
Aug 11 2016 14:58
yes you're right, I did not have any PDF (I have an option --no_plots)........
Paolo Di Tommaso
@pditommaso
Aug 11 2016 14:58
!!!!!
Tiffany Delhomme
@tdelhomme
Aug 11 2016 14:59
:+1: like here I have to create a pdf if not created by my process?
https://github.com/IARCbioinfo/needlestack/blob/dev/needlestack.nf#L393
Paolo Di Tommaso
@pditommaso
Aug 11 2016 15:00
you can do that ..
I will add a message explaining why the job was re-executed ;)
Tiffany Delhomme
@tdelhomme
Aug 11 2016 15:01
but when expected files are not present nextflow returns an explicit message in the log, didn't it??
Paolo Di Tommaso
@pditommaso
Aug 11 2016 15:01
not when using retry :(
Tiffany Delhomme
@tdelhomme
Aug 11 2016 15:03
hum but why?!
Paolo Di Tommaso
@pditommaso
Aug 11 2016 15:03
what ?
Tiffany Delhomme
@tdelhomme
Aug 11 2016 15:03
why no error message with retry?
Paolo Di Tommaso
@pditommaso
Aug 11 2016 15:05
there's only when debugging is enable to avoid too much noise in the log file
(leaving)
Tiffany Delhomme
@tdelhomme
Aug 11 2016 15:07
ok thanks a lot ! I will relaunch correctly :smile:
Paolo Di Tommaso
@pditommaso
Aug 11 2016 15:07
welcome
Tiffany Delhomme
@tdelhomme
Aug 11 2016 15:09
:smile: :+1:
Mike Smoot
@mes5k
Aug 11 2016 21:09
I'm wondering if any has experienced this docker problem with nextflow docker/docker#19296 and if you've possibly found a workaround. The tl;dr summary is that docker will occasionally fail to start a container leaving it in a dead state if you try and launch too many containers concurrently. With many short processes in nextflow, this has been happening for us.
Paolo Di Tommaso
@pditommaso
Aug 11 2016 23:58
@mes5k Actually never seen. Have you tried using a different storage driver ?