These are chat archives for nextflow-io/nextflow

15th
Feb 2016
wuyilei
@wuyilei
Feb 15 2016 03:07

hello, guys. we have runned a wgs pipeline through nextflow, in grid engine clutser. A batch jobs have been submitted. Everything is fine and most jobs can be finished successfully, except a few jobs hang at running states, without any error. But in fact the job is already finished. When log in to the compute node and dig into more detail, we found the job process is still there, hanged at tee .command.out. The entire process is shown below:

```

any idea how this happens?

 3079 ?        Sl    71:10 /usr/bin/sge_execd
25652 ?        S      0:00  \_ sge_shepherd-43799 -bg
25653 ?        Ss     0:00      \_ /bin/bash /var/spool/gridengine/default/DN-03/job_scripts/43799
25665 ?        S      0:00          \_ tee .command.out
wuyilei
@wuyilei
Feb 15 2016 03:15
btw: this issue can be duplicated in our cluster
Paolo Di Tommaso
@pditommaso
Feb 15 2016 09:11
@wuyilei tee is used to save the program output while presenting the original stdout
do you know that job working directory ?
what is the content of the .exitcode file ?
wuyilei
@wuyilei
Feb 15 2016 09:16
working directory is /glusterfs/home/wuyl/test/wgs_tuning/pipeline_wgs/work/39/e4a6b5527058b2e6f914f9d55b2c35/
Paolo Di Tommaso
@pditommaso
Feb 15 2016 09:17
ok, in that folder there should be a .exitcode file. What is the content?
wuyilei
@wuyilei
Feb 15 2016 09:17
there is no .exitcode file yet in this directory yet
Paolo Di Tommaso
@pditommaso
Feb 15 2016 09:17
um, thus it has been killed
can you list the content of that folder with ls -la ?
wuyilei
@wuyilei
Feb 15 2016 09:25
drwxr-xr-x 2 wuyl bioinfo 8287 Feb 15 03:01 .
drwxr-xr-x 4 wuyl bioinfo 240 Feb 15 13:53 ..
-rw-r--r-- 1 wuyl bioinfo 0 Feb 15 02:54 .command.begin
-rw-r--r-- 1 wuyl bioinfo 1803 Feb 15 02:54 .command.env
-rw-r--r-- 1 wuyl bioinfo 184932 Feb 15 03:02 .command.err
-rw-r--r-- 1 wuyl bioinfo 419411 Feb 15 03:02 .command.log
-rw-r--r-- 1 wuyl bioinfo 234479 Feb 15 03:02 .command.out
prw-r--r-- 1 wuyl bioinfo 0 Feb 15 02:54 .command.pe
prw-r--r-- 1 wuyl bioinfo 0 Feb 15 02:54 .command.po
-rw-r--r-- 1 wuyl bioinfo 2993 Feb 15 02:54 .command.run
-rw-r--r-- 1 wuyl bioinfo 2483 Feb 15 02:54 .command.run.1
-rw-r--r-- 1 wuyl bioinfo 458 Feb 15 02:54 .command.sh
-rw-r--r-- 1 wuyl bioinfo 192 Feb 15 03:02 .command.trace
lrwxrwxrwx 1 wuyl bioinfo 136 Feb 15 02:54 sample_sorted_dedup.bam -> /glusterfs/home/wuyl/test/wgs_tuning/pipeline_wgs/work/48/bfae21968971fbfb99d5f81138d140/sample_sorted_dedup.bam
-rw-r--r-- 1 wuyl bioinfo 31094 Feb 15 03:02 sample_sorted_dedup.bam-chr13.predSV.txt
lrwxrwxrwx 1 wuyl bioinfo 140 Feb 15 02:54 sample_sorted_dedup.bam.bai -> /glusterfs/home/wuyl/test/wgs_tuning/pipeline_wgs/work/48/bfae21968971fbfb99d5f81138d140/sample_sorted_dedup.bam.bai
lrwxrwxrwx 1 wuyl bioinfo 157 Feb 15 02:54 sample_sorted_dedup.bam.chr13.cover_filtered -> /glusterfs/home/wuyl/test/wgs_tuning/pipeline_wgs/work/48/bfae21968971fbfb99d5f81138d140/sample_sorted_dedup.bam.chr13.cover_filtered
lrwxrwxrwx 1 wuyl bioinfo 83 Feb 15 02:54 database -> /glusterfs/home/wuyl/test/wgs_tuning/pipeline_wgs/data/database
Paolo Di Tommaso
@pditommaso
Feb 15 2016 09:27
the fact there's no the .exitcode file means that the jobs has been killed hardly by the batch scheduler
this could explain why there was that process hung
do you have the SGE accounting installed in your system ?
the command qacct -j 43799 should report the job exit status and the cause of the error
wuyilei
@wuyilei
Feb 15 2016 09:31
qstat shows that job is still running status. qacct -j 43799 output as :
error: job id 43799 not found
Paolo Di Tommaso
@pditommaso
Feb 15 2016 09:32
ah
what's the content of .command.sh ?
Paolo Di Tommaso
@pditommaso
Feb 15 2016 09:48
Yes, it may be a problem that the script in your command is not closing correctly the stdout
I would try to redirect the stdout of your command to a file
something like:
script.pl ... > out.txt
wuyilei
@wuyilei
Feb 15 2016 09:50
we will try that later, thx a lot!
Paolo Di Tommaso
@pditommaso
Feb 15 2016 09:51
you can do killing that job
modifying the content of .command.sh in that way
and then re-submitting the job with qsub .command.run
hope this helps
wuyilei
@wuyilei
Feb 15 2016 09:52
thx. we will try. hope it's the cause.
we need to rerun the entire the job. becuz the although issue can be duplicated, but not every time
Paolo Di Tommaso
@pditommaso
Feb 15 2016 09:54
I see