These are chat archives for nextflow-io/nextflow

5th
Sep 2017
Simone Baffelli
@baffelli
Sep 05 2017 07:25
Good morning. Lesson learned the hard way: do not run more than one nextflow pipeline in the same working directory :clap:
Paolo Di Tommaso
@pditommaso
Sep 05 2017 07:31
at the same time? what happened ?
Simone Baffelli
@baffelli
Sep 05 2017 07:33
no at different times, but that messes the log and history up
at least in my case, I had to manually remove cache files and touch the history file
I had several cases of Missing cache index file:
the only solution was to remove the corresponding session_id from the history and delete the history folder manually
Paolo Di Tommaso
@pditommaso
Sep 05 2017 07:35
weird, it should work
Simone Baffelli
@baffelli
Sep 05 2017 07:35
I don't know, perhaps I did a mess
I noticed that it usuallt happens with entries that have - as STATUS in the log
Simone Baffelli
@baffelli
Sep 05 2017 07:41
It cannot deal with missing index files (understandibly)
the quick and dirty solution is to use sed and delete these entries from the history
Paolo Di Tommaso
@pditommaso
Sep 05 2017 07:42
umm, is there error message in the log file of the previous executions
Simone Baffelli
@baffelli
Sep 05 2017 07:45
too late, cleaning up now
but what do you mean exactly? It could be that I will encounter the issue again
I confirm, some index files are missing for an unkown reason. They are always associated with runs where status is - in the log
perhaps these runs were interrupted before the index file could be written?
Paolo Di Tommaso
@pditommaso
Sep 05 2017 07:49
hard to say, are you able to replicate the problem with a small example?
Simone Baffelli
@baffelli
Sep 05 2017 07:52
I'm not sure, I don't know what I did exactly
My suspicion is that I interrupted a run using CTRL+C before the index was written
when does nextflow write the index>
Paolo Di Tommaso
@pditommaso
Sep 05 2017 07:56
CTRL+C should be safe ..
Simone Baffelli
@baffelli
Sep 05 2017 12:02
Now that is a weird error: Sep-05 14:00:30.644 [Actor Thread 104] DEBUG nextflow.util.CacheHelper - Unable to hash file: home -- Cause: java.nio.file.NoSuchFileException: home
Paolo Di Tommaso
@pditommaso
Sep 05 2017 12:07
Open an issue with the complete log file, please
Simone Baffelli
@baffelli
Sep 05 2017 12:09
is that a known issue?
That is interesting as well: Unknown hashing type: class java.util.LinkedHashMap$Entry
Simone Baffelli
@baffelli
Sep 05 2017 12:15
They seem to be related in that they are observed for the same thread
Paolo Di Tommaso
@pditommaso
Sep 05 2017 12:17
You are messing up something with map values, likely a variable in the script context
Simone Baffelli
@baffelli
Sep 05 2017 12:20
I'll try replacing the map with a list of tuples
and see if that plays better
Piotr Kaleta
@pkaleta
Sep 05 2017 12:28
Does nextflow provide any kind of data provenance? I.e. assuming I have a certain output is it possible to trace the versions of the code for all the tasks used in my pipeline that was used to create that output?
The same question about the inputs used to produce a particular output file.
Simone Baffelli
@baffelli
Sep 05 2017 12:50
Still having the same problem. Very puzzling
Paolo Di Tommaso
@pditommaso
Sep 05 2017 12:59
do you mean the version of the tools? NF is not aware of the tools you are running in your task, in can be one, many of just a bash script
however if you are using containers, you can easily track the container version
Piotr Kaleta
@pkaleta
Sep 05 2017 13:00
I don't mean tools
Assume I'm iterating on how the command script looks like.
Some of my command scripts are plain scripts written in the .nf file.
Others are python scripts called from within .nf file.
I want to be able to tell versions of NF command scripts and python scripts were used to produce particular output.
Paolo Di Tommaso
@pditommaso
Sep 05 2017 13:03
the nextflow log allows you to dump the script and many information for each execute tasks
however it would help little with external scripts
Simone Baffelli
@baffelli
Sep 05 2017 13:04
@pditommaso I think that request is very similar to #413
Paolo Di Tommaso
@pditommaso
Sep 05 2017 13:04
my suggestion is to track all your scripts including (including external scripts) with a Git repository
that would allow you to handle versions in a consistent manner
#413 is more related to the cache invalidation
Piotr Kaleta
@pkaleta
Sep 05 2017 13:06
Right, with Git however, I'd still need to track which version of my Python script was used for each pipeline version externally right?
Paolo Di Tommaso
@pditommaso
Sep 05 2017 13:07
well, you can consider the whole pipeline as a unique project, you don't need to keep a version number for each script
Piotr Kaleta
@pkaleta
Sep 05 2017 13:10
I'm already doing that. The question is how do I go from a particular output file to the Git revision of my pipeline that was used to produce it.
Simone Baffelli
@baffelli
Sep 05 2017 13:10
I don't think that is directly possible at the moment
Paolo Di Tommaso
@pditommaso
Sep 05 2017 13:13
when you launch a NF pipeline it prints the git commit ID for that repo.
You can use that number to access to the specific version of any script
Simone Baffelli
@baffelli
Sep 05 2017 13:13
something nasty is happening, related to collectFile
but I don't understand what and why
Paolo Di Tommaso
@pditommaso
Sep 05 2017 13:14
for example if the commit is 3b5b49f then you can browse the repo https://github.com/nextflow-io/rnaseq-nf/blob/3b5b49f/main.nf
Piotr Kaleta
@pkaleta
Sep 05 2017 13:22

when you launch a NF pipeline it prints the git commit ID for that repo.

I only get this:

N E X T F L O W  ~  version 0.24.3
Launching `versioned_pipeline.nf` [elegant_kowalevski] - revision: 14aadc973f
[warm up] executor > local
but 14aadc973f is not the commit hash in my repo
Paolo Di Tommaso
@pditommaso
Sep 05 2017 13:23
I guess you are proving the local script not the repo URL on the run command line
Piotr Kaleta
@pkaleta
Sep 05 2017 13:24
yes
Paolo Di Tommaso
@pditommaso
Sep 05 2017 13:24
that's the problem
Piotr Kaleta
@pkaleta
Sep 05 2017 13:25
k, i'll try that
are there docs about how to do that with a private repository?
and private GH enterprise?
Piotr Kaleta
@pkaleta
Sep 05 2017 13:28
Thanks!
Paolo Di Tommaso
@pditommaso
Sep 05 2017 13:28
:+1:
Piotr Kaleta
@pkaleta
Sep 05 2017 13:30
I guess one consequence of the above design is the fact that your pipeline DAG (*.nf file) has to live in the same repo as your external scripts.
Is there a way to split a single huge *.nf file to separate pieces, e.g. per each process?
Paolo Di Tommaso
@pditommaso
Sep 05 2017 13:31
not at this time
Piotr Kaleta
@pkaleta
Sep 05 2017 13:32
ok
Paolo Di Tommaso
@pditommaso
Sep 05 2017 13:32
tho, you can create a separate BASH, Python, etc. script fro each task
Piotr Kaleta
@pkaleta
Sep 05 2017 13:32
that's true
Simone Baffelli
@baffelli
Sep 05 2017 14:25
What does this error message ERROR n.extension.DataflowExtensions - @unknown imply? Something is very weird with collectFile
Paolo Di Tommaso
@pditommaso
Sep 05 2017 14:26
look at the error stack trace in the log file
Simone Baffelli
@baffelli
Sep 05 2017 14:27
It is huge!!! :scream: https://pastebin.com/WTSxW4xN
everything seems to be caused by java.nio.file.NoSuchFileException: home
maybe I know why
Paolo Di Tommaso
@pditommaso
Sep 05 2017 14:29
I don't this it cause the problem, there should some other error before
Simone Baffelli
@baffelli
Sep 05 2017 14:30
Yes, found it out. I was trying to apply collectFile on a channel preceded with a map{it->it[0]} but the channel was not passing tuples, only single files.
Paolo Di Tommaso
@pditommaso
Sep 05 2017 14:31
:+1:
Simone Baffelli
@baffelli
Sep 05 2017 14:32
1/2 day wasted because of a typo
Paolo Di Tommaso
@pditommaso
Sep 05 2017 14:32
shit happens . . .
Steve Marshall
@stevemmarshall
Sep 05 2017 18:56
Has anyone had this issue when running the command: nextflow cloud spot-prices
WARN: Unknown instance type: m4.large
WARN: Unknown instance type: i3.2xlarge ...
i set my AWS env't variables properly
i'm following the instructions from here https://www.nextflow.io/docs/latest/awscloud.html
Piotr Kaleta
@pkaleta
Sep 05 2017 19:01

Just to make sure what I'm seeing is expected behavior: nextflow doesn't take the docker image into account when calculating a hash to cache the outputs, am I right?

In other words, if just the docker image changes, nextflow will not rerun the process. Instead it will use the cached version, if it exists.

Paolo Di Tommaso
@pditommaso
Sep 05 2017 19:05
@stevemmarshall something seems broken, I guess Amazon changed the prices file format, can you please open an issue for that https://github.com/nextflow-io/nextflow/issues
@pkaleta you are right
Piotr Kaleta
@pkaleta
Sep 05 2017 19:10
What's the motivation behind this? Would it be too computationally expensive to compute the hash including docker image?
It would be nice if it happened with cache 'deep'
Paolo Di Tommaso
@pditommaso
Sep 05 2017 19:12
because task output should depends only on the inputs + script
but I see you point, maybe it could have sense to include also the container name, you may want to open an issue for that
Félix C. Morency
@fmorency
Sep 05 2017 19:13
+1
Mike Smoot
@mes5k
Sep 05 2017 19:14
Agreed, triggering a process after changing the image name or version sounds good.
Piotr Kaleta
@pkaleta
Sep 05 2017 19:15
I'll open GH issue for that
Steve Marshall
@stevemmarshall
Sep 05 2017 20:19
@pditommaso just opened an issue... thanks
Paolo Di Tommaso
@pditommaso
Sep 05 2017 20:20
welcome
Steve Marshall
@stevemmarshall
Sep 05 2017 20:20
I also tried without the spot instances and it gave me the same issue