These are chat archives for nextflow-io/nextflow

24th
Jan 2019
rfenouil
@rfenouil
Jan 24 08:47
@stevekm Thank you for the suggestion I'll give it a try
rfenouil
@rfenouil
Jan 24 08:53
I started to think about a centralized local cache database for my institute.
When most pipelines share similar pre-processing steps, it could be a huge gain in time and disk space...
If we start using nextflow libraries (reusable bricks) to build workflows, I think that could make sense. Any opinion on that ?
Shellfishgene
@Shellfishgene
Jan 24 10:07
How can I find out why nextflow does not resume a run? I have a large core-nf RNASeq run, which I have resumed after some errors, and every time nextflow used the cached results. Now it stopped again (drive full), but upon restarting with the exact same command line (with -resume) it started to resubmit all jobs from the beginning.
Hugues Fontenelle
@huguesfontenelle
Jan 24 12:14

Hello,
I've got some kind of bug. I'm suspecting that this has to do with my cluster, and not really nextflow. @pditommaso what do you think? Thanks

Jan-23 19:35:09.326 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exist
 status for process TaskHandler[jobld: 3198179; id: 2; name: mapping (2); status: RUNNING;  exit: -; error: -;
workDir: /net/p01-c2io-nfs/projects/p123/0I/swings-pipeline/work/64/a3da68588cd06bd4381d41249fae29 started: 1548268084386;
exited: -; ] exitStatusReadTimeoutMillis: 270000; delta: 270021

Process failing. All files in that process are produced properly, nothing in .command.err, but indeed I don't see any .exitcode!

Hugues Fontenelle
@huguesfontenelle
Jan 24 13:08
SLURM cluster, Singularity containers
Hugues Fontenelle
@huguesfontenelle
Jan 24 13:14

@Shellfishgene says:

I have a suspicion this may be related to file system slowness on our cluster,

I used to "solve" this with a sleep 60 and the end of some processes.

Shellfishgene
@Shellfishgene
Jan 24 13:17
@huguesfontenelle That's worth a try, thanks.
Stephen Kelly
@stevekm
Jan 24 16:24

How can I find out why nextflow does not resume a run?

A lot of reasons but I was wondering this for some of my pipeline steps recently, when I realized that I was printing the Nextflow session ID into an output file (and embedded as an argument in my process scripts), so that caused every 'resume' of the workflow to have a different output at that step, which then causes downstream steps to re-run

So check that there is not something embedded within your process script section that changes every time you re-run the workflow
this could also apply to other parts of the 'process' as well, for example I updated some pipeline steps to stage an extra input file and it also caused all those steps to re-run even though the actual script section was unchanged
Stephen Kelly
@stevekm
Jan 24 16:29
@huguesfontenelle you should look up the SLURM records for that job, sacct -j 3198179
I have been have huge amounts of issues with our new SLURM cluster losing and breaking jobs
and cross-reference with the work directory, look closely at the timestamps and logs to see if you can pinpoint when exactly things stopped working. Usually SLURM leaves an error message when it kills your job, but it could also be possible that the job was killed silently or something
Stephen Kelly
@stevekm
Jan 24 16:37

@rfenouil

I started to think about a centralized local cache database for my institute.
When most pipelines share similar pre-processing steps, it could be a huge gain in time and disk space...
If we start using nextflow libraries (reusable bricks) to build workflows, I think that could make sense. Any opinion on that ?

I think it depends a little on how your pipelines are organized and run. My group's repo is here: https://github.com/NYU-Molecular-Pathology ; our main pipelines are the demux-nf and NGS580-nf ones. Since I am working on NGS sequencing runs with mutliple samples each, I treat each run as a separate project, so there are dedicated directories for each run's Demultiplexing and for its NGS580 analysis. Each one has its own copy of the Nextflow pipeline at those repos. I find this kind of setup to be advantageous because it keeps each pipeline instance isolated, allowing me to work on developing and debugging issues in one pipeline instance without affecting the others. It also keeps the results of the analysis tightly associated with all the exact code & configs used to produce it. The downside is that later when you want to do customized analyses which aggregate samples from multiple runs, you end up having to make a new pipeline instance which then forgoes all the cache'ing from the previous run results. But in general I think that having a single pipeline installation instance that you use for all your samples across disparate projects is asking for a messy situation.

In regards to having shared pipeline steps across multiple pipelines, I am still a strong advocate for simply copy & pasting the code & configs across the pipelines. When you start including external dependencies (e.g. external shared pipeline segments), it becomes extremely confusing to debug, and you now have to deal with dependency management issues which are already a plague everywhere else in data analysis (e.g. software versioning and library dependencies and management of them, etc.). I would much rather have copy/pasted the same steps across many pipelines than have yet another level of dependency management to worry about
Stephen Kelly
@stevekm
Jan 24 16:42
just my opinion of course lol
Paolo Di Tommaso
@pditommaso
Jan 24 16:43
modules are under active dev nextflow-io/nextflow#984
micans
@micans
Jan 24 16:44
I think there is a sweet spot where modules become profitable. Everyone judges that sweet spot differently, I am probably quite close to @stevekm ...
Paolo Di Tommaso
@pditommaso
Jan 24 16:45
@huguesfontenelle tend to agree with Stephen, in any case need a detailed error report and the .nextflow.log file to say more
micans
@micans
Jan 24 16:45
(regardless, modules are cool)
Paolo Di Tommaso
@pditommaso
Jan 24 16:46
I agree as well in the module topics, but there's also the opportunity to simplify the work for big pipelines
Tobias Neumann
@t-neumann
Jan 24 16:49

Is this https://gist.github.com/pditommaso/ff13c333f461ca0d9b839d9e3416b376 still a valid pattern?

I was trying to implement it like this:

script:

    def configArg = mantaConfigFile.name != 'NO CONFIG' ? "--config $(params.mantaConfig)" : ''

but then I get:

ERROR ~ illegal string body character after dollar sign;
   solution: either escape a literal dollar sign "\$5" or bracket the value expression "${5}" @ line 212, column 57.
   onfigFile.name != 'NO CONFIG' ? "--confi
                                 ^

1 error
rfenouil
@rfenouil
Jan 24 16:52
@stevekm Thank you for your feedback. I tend to agree with your analysis for my everyday use (I need to get some experience with modules though).
I guess a central cache database could be beneficial in very specific case only. I was thinking about a data repository where many users work on subsets of a large dataset. They all develop custom pipelines. Most of them share quite similar first steps (pre-processing), then diverge (downstream analyses). It's an ideal case for modules and shared cache, but probably not a very common use-case ;)
Paolo Di Tommaso
@pditommaso
Jan 24 16:53
this is wrong "--config $(params.mantaConfig)" is should be "--config ${params.mantaConfig}"
or just "--config $params.mantaConfig"
Stephen Kelly
@stevekm
Jan 24 16:54
yea I am a little jaded on the modules topic cause I have been burned too many times by pipelines I had to dev on and debug that used them heavily, lots of headaches, but it also easier when they are modules you made yourself vs. inheriting someone else's system
Paolo Di Tommaso
@pditommaso
Jan 24 16:54
@rfenouil do you mean for computation purpose or other needs eg provenance, metadata etc ?
rfenouil
@rfenouil
Jan 24 16:59
I was thinking about computation, but just dreaming really.
My use case is a single pipeline that I run on different subsets of ~100 files. For my own organization, I do that in a different folder each time.
I would have liked the caching (resume) mechanism to search for hashes in all folders so it does not recompute the first step for file A if it has been done before (when processing a different subset).
Paolo Di Tommaso
@pditommaso
Jan 24 17:01
well, that's an idea on which I've played as well, it should be possible
out of curiosity what's your org ?
rfenouil
@rfenouil
Jan 24 17:01
Org ? Institute ?
Paolo Di Tommaso
@pditommaso
Jan 24 17:01
yes organisation, if you can tell
micans
@micans
Jan 24 17:01
organism
rfenouil
@rfenouil
Jan 24 17:02
Ok ;) Mediterranean Institute of Oceanography
Academic research
Paolo Di Tommaso
@pditommaso
Jan 24 17:02
ah nice
rfenouil
@rfenouil
Jan 24 17:02
Got introduced to nextflow by looking for RNA-seq pipeline, found Phil Ewels one... Then started to play with NF
Paolo Di Tommaso
@pditommaso
Jan 24 17:03
ah nice
we owe phil a lot :)
rfenouil
@rfenouil
Jan 24 17:03
;)
BTW is there any NF workshop planned soon ?
Checked on CRG website but couldn't find it
Paolo Di Tommaso
@pditommaso
Jan 24 17:04
umm CRG likely in September
rfenouil
@rfenouil
Jan 24 17:05
Ok sounds good, I'll keep an eye on it.
Paolo Di Tommaso
@pditommaso
Jan 24 17:05
but also in tuebingen end of April, ping @apeltzer
rfenouil
@rfenouil
Jan 24 17:05
Ohhh interesting
will do
thanks a lot
Need to go, thank you and sorry for the silly ideas.
If you find an easy solution to share cache between runs, let me know :)
Paolo Di Tommaso
@pditommaso
Jan 24 17:06
bye
rfenouil
@rfenouil
Jan 24 17:07
Awesome
Alexander Peltzer
@apeltzer
Jan 24 17:13
Yes, I'll most likely push out a notification on Twitter again soon... was probably too early to advertise it :-)
Paolo Di Tommaso
@pditommaso
Jan 24 17:13
I guess so
Maxime Garcia
@MaxUlysse
Jan 24 17:22
Otherwise @rfenouil you know that you can come to work with us one week whenever you want ;-)
Tobias Neumann
@t-neumann
Jan 24 18:23
@pditommaso damn - I guess I should start checking out some class. thanks
Stephen Kelly
@stevekm
Jan 24 19:01
hey quick question about NXF_DEBUG, do the values ('1', '2', '3') stack? Like if I set it to 3, will it also cause the level 1 and 2 logging as well>
Shawn Rynearson
@srynobio
Jan 24 19:33
I wondering how nextflow handles pulling docker containers from private dockerhub repos?
I haven't found anything specific on the documentation site.
it's not specifically mentioned but could I create a scm file
Jason Yamada-Hanff
@yamad
Jan 24 22:45
is there an easy way to rename staged files? that is, I have been using this style set val(name), file('outfile.csv') into process_out_ch for output. Now I want to collect all of those files and operate on them together. For some tools, it's most natural to have the input files named after the sample, so something with the following semantic: set val(name), file("${name}.csv") from process_out_ch.
Rad Suchecki
@rsuchecki
Jan 24 23:01
probably easiest is to set val(name), file("${name}.csv") into process_out_ch in the first process
Rad Suchecki
@rsuchecki
Jan 24 23:22
or rather file("${name}.csv") into process_out_ch if you want to use process_out_ch.collect() downstream
Jason Yamada-Hanff
@yamad
Jan 24 23:28
yeah, thanks. I will probably go with that for now. was hoping there was a local way to do it.
Cedric
@Puumanamana
Jan 24 23:50
Hey! Is there an easy way to run a custom script after a given process generated all of its outputs? (i.e. once it has finished to process all of its inputs). It's basically counting stuff to save a summary in the end. I know I could maybe do a separate process that collects all the files, but I want to do it for each process of the pipeline and it would be a bit heavy.