These are chat archives for nextflow-io/nextflow

27th
Jul 2017
Simone Baffelli
@baffelli
Jul 27 2017 07:50

@sergpolly

5) Is it possible to use 2 publishDir statements in per process, to make some files go to one folder, and other into a different one?:

    publishDir path: getIntermediateDir('pairsam_run'), pattern: "*.pairsam.gz" 
    publishDir path: getOutDir('stats_run'), pattern: "*.stats", mode:"copy"

I use a closure to achieve that. It is a bit annoying to write but it works:

    publishDir path:"${params.results}/", 
    saveAs: {
      fn -> 
      switch(fn)
        {
          case "ifgram": return "${params.ifgram_dir}/${format_pair_name(master_id,slave_id)}.int"
          case "off_par": return "${params.ifgram_dir}/${format_pair_name(master_id,slave_id)}.off_par"
          case "master_mli": return "${params.mli_dir}/${master_id}.mli"
          case "slave_mli":  return "${params.mli_dir}/${slave_id}.mli"
          case "master_mli_par": return "${params.mli_dir}/${master_id}.mli.par"
          case "slave_mli_par":  return "${params.mli_dir}/${slave_id}.mli.par"
          case "ifgram.bmp": return "${params.ifgram_dir}/${format_pair_name(master_id,slave_id)}.bmp"
        }
       }
Paolo Di Tommaso
@pditommaso
Jul 27 2017 08:57
OH !
nice thick at this point to make the code more readable I would create an help method and pass is as reference, eg:
def foo(fn) {
      switch(fn)
        {
          case "ifgram": return "${params.ifgram_dir}/${format_pair_name(master_id,slave_id)}.int"
          case "off_par": return "${params.ifgram_dir}/${format_pair_name(master_id,slave_id)}.off_par"
          case "master_mli": return "${params.mli_dir}/${master_id}.mli"
          case "slave_mli":  return "${params.mli_dir}/${slave_id}.mli"
          case "master_mli_par": return "${params.mli_dir}/${master_id}.mli.par"
          case "slave_mli_par":  return "${params.mli_dir}/${slave_id}.mli.par"
          case "ifgram.bmp": return "${params.ifgram_dir}/${format_pair_name(master_id,slave_id)}.bmp"
        }
 }
then
publishDir path:"${params.results}/", saveAs: this.&foo
Maxime Garcia
@MaxUlysse
Jul 27 2017 09:13
Ohhh pretty neat
I'm stealing that
Paolo Di Tommaso
@pditommaso
Jul 27 2017 09:15
like Picasso said: Good programmers copy; great programmers steal :sunglasses:
Phil Ewels
@ewels
Jul 27 2017 10:49
@MaxUlysse - we already use this in most of our NGI- pipelines :laughing:
Maxime Garcia
@MaxUlysse
Jul 27 2017 11:02
Did not saw that before
I do need to look into all the NGI pipelines more
Paolo Di Tommaso
@pditommaso
Jul 27 2017 11:04
You need to update the book
:satisfied:
Phil Ewels
@ewels
Jul 27 2017 11:10
hah, yup! Or perhaps our (more boring) book which we started but never really got into..
Paolo Di Tommaso
@pditommaso
Jul 27 2017 11:11
don't tell me how boring is to write documentation :grimacing:
Simone Baffelli
@baffelli
Jul 27 2017 11:38
@pditommaso Indeed, I never made that into a method because that was a quick hack made at a conference while improving my slides
Actually, one could also change it slighlty to make the method dispatch the paths based on a pattern ;)
Paolo Di Tommaso
@pditommaso
Jul 27 2017 11:41
:+1:
Phil Ewels
@ewels
Jul 27 2017 11:43
Yup - we use lots of fn.indexOf but it still ends up being fairly verbose
Worth mentioning for those new to this technique (eg @MaxUlysse) that you can also return null and the file won't be saved. Useful if you have a param for whether to save certain input (eg. intermediate alignments etc)
See example
Simone Baffelli
@baffelli
Jul 27 2017 11:47
I'll try to make a function that takes a dictionary of pattern:name and produces the path..that would be useful and less verboes
Simone Baffelli
@baffelli
Jul 27 2017 11:58
I'm trying to start a julia kernel only once per pipeline run and passing the kernels PID to the process for them to connect to it and run certain scripts, because directly calling julia for each process instance is very costly due to a huge startup overhead. Is there a best practice to do such things? Some sort of beforeScript that is run at the start of the pipeline and not of each process?
Of course I could just use a process with no inputs, but I'm not sure thats is the most elegant approach
Simone Baffelli
@baffelli
Jul 27 2017 12:21
:confused: I'm using too many languages :scream:
Paolo Di Tommaso
@pditommaso
Jul 27 2017 12:23
Some sort of beforeScript that is run at the start of the pipeline and not of each process?
@baffelli I was thinking about something similar but it's not available at this tome
but why do you need to launch a daemon?
and don't use instead the Julia interpreter ?
Simone Baffelli
@baffelli
Jul 27 2017 12:31
Because I need certain packages that take a very long time to start
unlike python, julia can be rather slow at the startup
Paolo Di Tommaso
@pditommaso
Jul 27 2017 12:31
umm, I see
but this is supposed to run in a cluster scheduler ?
Simone Baffelli
@baffelli
Jul 27 2017 12:32
sadly not :worried:
locally on my machine
I've got a time series of 18500 images to process
Paolo Di Tommaso
@pditommaso
Jul 27 2017 12:32
ah, yes your boss doesn't agree :)
well, just use a bash wrapper to launch the kernel and then the pipeline
Simone Baffelli
@baffelli
Jul 27 2017 12:33
thats a good idea...I'll be a struggle anyway because julia does not really use a deamon based approach
I will have to heavily edit my scripts
Sergey Venev
@sergpolly
Jul 27 2017 14:51
Thank you @baffelli ! I'll steal the idea for future modifications of the pipeline.
But meanwhile, I'd really love to figure out why the pipeline keeps crashing on our LSF cluster
I have some interesting logs @pditommaso , https://pastebin.com/Q2ZhmdmZ
I could explain briefly what happened
[64/911b13] Submitted process > merge_pairsam_into_runs (library:HeLa1 run:lane2)
[d7/0d451f] Submitted process > merge_pairsam_into_runs (library:HeLa1 run:lane4)
[08/9780f5] Submitted process > merge_pairsam_into_runs (library:HeLa1 run:lane1)
[37/5bc7a3] Submitted process > merge_pairsam_into_runs (library:HeLa1 run:lane3)
[10/e29b2b] Submitted process > merge_pairsam_into_runs (library:HeLa1 run:lane5)
WARN: Process `merge_pairsam_into_runs (library:HeLa1 run:lane4)` terminated with an error exit status (140) -- Execution is retried (1)
[46/6b1cd3] Re-submitted process > merge_pairsam_into_runs (library:HeLa1 run:lane4)
WARN: Process `merge_pairsam_into_runs (library:HeLa1 run:lane4)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (2)
[55/97542f] Re-submitted process > merge_pairsam_into_runs (library:HeLa1 run:lane4)
WARN: Process `merge_pairsam_into_runs (library:HeLa1 run:lane1)` terminated with an error exit status (140) -- Execution is retried (1)
[c1/63091a] Re-submitted process > merge_pairsam_into_runs (library:HeLa1 run:lane1)
WARN: Process `merge_pairsam_into_runs (library:HeLa1 run:lane5)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
[5b/ba7900] Re-submitted process > merge_pairsam_into_runs (library:HeLa1 run:lane5)
Paolo Di Tommaso
@pditommaso
Jul 27 2017 14:54
140 generally mean the cluster killed the job because it's using more resources than requested
Sergey Venev
@sergpolly
Jul 27 2017 14:54
This one I consider normal behavior
But terminated for an unknown reason
is more tricky
It has been the same scenario for me for the past 3 days - and it has been consistent more or less
Paolo Di Tommaso
@pditommaso
Jul 27 2017 14:56
let me check
Sergey Venev
@sergpolly
Jul 27 2017 14:57
What happens, is that I submit a pipeline which has some heavy step-process, that is first attempted to run in a short queue (under 4 hours) and than nextflow retries it in a long queue (~12 hours)
As you can see, 3 out of 5 were 140-terminated and re-submitted
then, nextflow thinks for some reason, that job (library:HeLa1 run:lane4) is terminated for an unknown reason, which is not the case - it's still running in the LSF-cluster
And usually everything stops at this point (last 2 times), because I have 2 retries for this job, but today nextflow decides to re-submit it one more time(!)
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:01
look here
Jul-27 06:36:17.388 [Task monitor] DEBUG nextflow.file.FileHelper - NFS path (true): /farline/umw_job_dekker/HPCC/sv49w/distiller-nf/work
Jul-27 06:40:42.441 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exist status for process TaskHandler[jobId: 4709252; id: 152; name: merge_pairsam_into_runs (library:HeLa1 run:lane4); status: RUNNING; exit: -; error: -; workDir: /farline/umw_job_dekker/HPCC/sv49w/distiller-nf/work/46/6b1cd3647d126694270092f4bc7a73 started: 1501151622383; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 270037
Current queue status:
>   job: 4709242: RUNNING
>   job: 4709240: RUNNING

Content of workDir: /farline/umw_job_dekker/HPCC/sv49w/distiller-nf/work/46/6b1cd3647d126694270092f4bc7a73
> total 133
> drwxrwxr-x 2 sv49w umw_job_dekker  183 Jul 27 06:33 .
> drwxrwxr-x 3 sv49w umw_job_dekker   48 Jul 27 06:32 ..
> -rw-rw-r-- 1 sv49w umw_job_dekker    0 Jul 27 06:33 .command.begin
> -rw-rw-r-- 1 sv49w umw_job_dekker   21 Jul 27 06:32 .command.env
> -rw-rw-r-- 1 sv49w umw_job_dekker   43 Jul 27 06:33 .command.log
> -rw-rw-r-- 1 sv49w umw_job_dekker 3486 Jul 27 06:32 .command.run
> -rw-rw-r-- 1 sv49w umw_job_dekker 2672 Jul 27 06:32 .command.run.1
> -rw-rw-r-- 1 sv49w umw_job_dekker  232 Jul 27 06:32 .command.sh

Jul-27 06:40:42.442 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 4709252; id: 152; name: merge_pairsam_into_runs (library:HeLa1 run:lane4); status: COMPLETED; exit: -; error: -; workDir: /farline/umw_job_dekker/HPCC/sv49w/distiller-nf/work/46/6b1cd3647d126694270092f4bc7a73 started: 1501151622383; exited: -; ]
Jul-27 06:40:42.443 [Task monitor] WARN  nextflow.processor.TaskProcessor - Process `merge_pairsam_into_runs (library:HeLa1 run:lane4)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (2)
Jul-27 06:40:42.564 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - Submitted process merge_pairsam_into_runs (library:HeLa1 run:lane4) > lsf jobId: 4709253; workDir: /farline/umw_job_dekker/HPCC/sv49w/distiller-nf/work/55/97542f59ff65877cb6ef464e4c8975
Jul-27 06:40:42.564 [Task submitter] INFO  nextflow.Session - [55/97542f] Re-submitted process > merge_pairsam_into_runs (library:HeLa1 run:lane4)
Sergey Venev
@sergpolly
Jul 27 2017 15:01
And now I have 2 instances of the same process running in the LSF-cluster
sv49w@very_big_computer:/farline/umw_job_dekker/HPCC/sv49w/distiller-nf$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
4709260    sv49w   RUN   long       c06b08      16*c34b08   *un_lane5) Jul 27 07:02
4709259    sv49w   RUN   long       c06b08      16*c07b01   *un_lane1) Jul 27 06:54
4709253    sv49w   RUN   long       c06b08      16*c04b05   *un_lane4) Jul 27 06:40
4706088    sv49w   RUN   long       ghpcc06     2*c06b08    *e cluster Jul 26 18:05
4709252    sv49w   RUN   long       c06b08      16*c26b08   *un_lane4) Jul 27 06:32
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:02
the job 4709252 looks completed, it's not reported any more in the queue status
there are only
>   job: 4709242: RUNNING
>   job: 4709240: RUNNING
hence NF tries to read the .exitcode file, but it's not available in the task directory
Sergey Venev
@sergpolly
Jul 27 2017 15:04
So, 4709252 went to zombie-mode
or whatever it's called
But you can see it's still running
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:04
umm
Sergey Venev
@sergpolly
Jul 27 2017 15:04
in the bjobs output
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:04
weird ..
Sergey Venev
@sergpolly
Jul 27 2017 15:04
crazy!
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:04
wait
Sergey Venev
@sergpolly
Jul 27 2017 15:05
it's an LSF-nextflow communication issue - of some kind
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:05
are using any slurm partition (ie queue) for this execution ?
Sergey Venev
@sergpolly
Jul 27 2017 15:06
slurm partition?
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:06
ok no :)
Sergey Venev
@sergpolly
Jul 27 2017 15:06
it's LSF
our cluster is weird I have to admit that
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:06
I was assuming it was slurm, ok
are you specifying any queue ?
Sergey Venev
@sergpolly
Jul 27 2017 15:07
our cluster has weird issues from time to time
Yes - it's short or long
What happens, is that I submit a pipeline which has some heavy step-process, that is first attempted to run in a short queue (under 4 hours) and than nextflow retries it in a long queue (~12 hours)
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:08
can you try this command
bjobs -o 'JOBID STAT SUBMIT_TIME delimiter=\',\'' -noheader
no wait
this should work
Sergey Venev
@sergpolly
Jul 27 2017 15:09
I can run any command to get this resolved
Something wrong with the quotes
these are normal single quotes?
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:11
umm
bjobs -o "JOBID STAT SUBMIT_TIME delimiter=','" -noheader
what about this?
Sergey Venev
@sergpolly
Jul 27 2017 15:12
4709260,RUN,Jul 27 07:02
4709259,RUN,Jul 27 06:54
4709253,RUN,Jul 27 06:40
4706088,RUN,Jul 26 18:05
4709252,RUN,Jul 27 06:32
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:12
so 4709252 it's there ..
Sergey Venev
@sergpolly
Jul 27 2017 15:13
yep!
Is there a way to make nextflow to ask about the process status several time
times
Like, assuming cluster is crazy and is not responsive sometimes?
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:14
it's done, I'm wondering if there's something wrong in the bjobs parsing
Sergey Venev
@sergpolly
Jul 27 2017 15:15
This has been consistent for 3 times now
3 pipeline submissions
a job gets resubmitted to long queue after retry and nextflow thinks it's being terminated right after
Can you point me to the LSF parser of the nextflow
I could try to figure that out
if it does not look too cryptic
Sergey Venev
@sergpolly
Jul 27 2017 15:19
and it is parson the output of this: ['bjobs', '-o', 'JOBID STAT SUBMIT_TIME delimiter=\',\'', '-noheader']
?
of the same thing that you gave me, basically?
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:20
yes
Simone Baffelli
@baffelli
Jul 27 2017 15:20
@sergpolly You are welcome. Always happy to help
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:21
I need to understand what's exactly the output of the bsubs
Sergey Venev
@sergpolly
Jul 27 2017 15:21
DO you treat 'UNKWN' and 'ZOMBI' as errors?
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:21
is not that the problem
we need to active this trace line
to do that you will need to re-launch the pipeline as shown below
nextflow -trace nextflow.executor.AbstractGridExecutor run .. etc
Sergey Venev
@sergpolly
Jul 27 2017 15:23
Need more details, before I screw up everyhting
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:24
what do you mean ? :grin:
Sergey Venev
@sergpolly
Jul 27 2017 15:24
question 1) do I have to keep saying NFX ...25.3 before to make sure its the new version?
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:24
yes
Sergey Venev
@sergpolly
Jul 27 2017 15:25
Ok!
"NXF_VER=0.25.3-SNAPSHOT nextflow -trace nextflow.executor.AbstractGridExecutor run distiller.nf -params-file project.yml -profile cluster"
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:25
:ok_hand:
Sergey Venev
@sergpolly
Jul 27 2017 15:25
Ok
Simone Baffelli
@baffelli
Jul 27 2017 15:25
you could probabily set it in in the configuration file, right?
env.NXF_VER=...
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:26
nope
NF variable are not read from the config file
Simone Baffelli
@baffelli
Jul 27 2017 15:26
:fearful:
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:26
it's too late the be applied !
Sergey Venev
@sergpolly
Jul 27 2017 15:26
only things is, it's a long pipeline and to reproduce the error we might want to resume it
from some closer point
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:27
I guess is fine
Sergey Venev
@sergpolly
Jul 27 2017 15:27
I'll try to think about - how to make it
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:28
add the end as soon there a job failing with that problem you can stop the execution and share the log file
Sergey Venev
@sergpolly
Jul 27 2017 15:29
Ok,
I think I'll just relaunch the whole thing
then
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:29
ok
Sergey Venev
@sergpolly
Jul 27 2017 15:30
than sometime tomorrow, I might have the result
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:30
ok
Sergey Venev
@sergpolly
Jul 27 2017 15:30
great! Thank you, again
Paolo Di Tommaso
@pditommaso
Jul 27 2017 15:30
welcome!