These are chat archives for nextflow-io/nextflow

16th
Aug 2017
Simone Baffelli
@baffelli
Aug 16 2017 09:06
Good morning. Does nextflow support a way of having a custom log?
Paolo Di Tommaso
@pditommaso
Aug 16 2017 09:08
you can use log.debug, log.info, etc but not sure it's what you are looking for
Simone Baffelli
@baffelli
Aug 16 2017 09:09
A "full" control of logging
In which I only show the information I am interested in
In some sense a second log file where I can onyl have my own messages that I produce with log.info etc
Paolo Di Tommaso
@pditommaso
Aug 16 2017 09:11
there are some proposal such #211 and #330 but not implemented yet
Simone Baffelli
@baffelli
Aug 16 2017 09:12
:+1:
Simone Baffelli
@baffelli
Aug 16 2017 12:24
@pditommaso is it possible atm to access the buffer variable from the opening or closing closure for the buffer operator?
Paolo Di Tommaso
@pditommaso
Aug 16 2017 13:30
do you mean the accumulated buffer ?
Simone Baffelli
@baffelli
Aug 16 2017 13:30
yes
Paolo Di Tommaso
@pditommaso
Aug 16 2017 13:30
no
Simone Baffelli
@baffelli
Aug 16 2017 13:31
Would you interested in it?
I was considering modifying the operator, if you would accept the pr
Paolo Di Tommaso
@pditommaso
Aug 16 2017 13:32
I have to tell the truth, I've never used it :)
Simone Baffelli
@baffelli
Aug 16 2017 13:33
I am always the one with extreme use cases ;)
Paolo Di Tommaso
@pditommaso
Aug 16 2017 13:33
how would you pass it? as a second closure parameter ?
Simone Baffelli
@baffelli
Aug 16 2017 13:33
Now I need to check if the total time span covered by a buffer exceeds a given threshold
at that moment, the buffer emits it
But having some trouble because some upstream process seems to be changing the order of data
Because the total time depends on the order. Having access to the buffer could help here
Paolo Di Tommaso
@pditommaso
Aug 16 2017 13:36
I think it's fine to have it as an optional second parameter to the opening/closing closure
Simone Baffelli
@baffelli
Aug 16 2017 13:37
exactly
that was my idea as well
Paolo Di Tommaso
@pditommaso
Aug 16 2017 13:38
green light
Simone Baffelli
@baffelli
Aug 16 2017 13:38
I will try to work on it when I will have some spare tim
Paolo Di Tommaso
@pditommaso
Aug 16 2017 13:39
no hurry, I'm on holiday since tomorrow :)
Simone Baffelli
@baffelli
Aug 16 2017 13:41
And I'm going mad at the same old problem :scream:
something is messing up the order of files somewhere
does combine change the order of incoming data?
Paolo Di Tommaso
@pditommaso
Aug 16 2017 13:43
nope
use -resume and -dump-hashes to spot what's the cause
Simone Baffelli
@baffelli
Aug 16 2017 13:43
I'm using collectFile
to save the id as they are received
and I see that although they are ordered upstream, the order changes somewher
but I can't find where
Simone Baffelli
@baffelli
Aug 16 2017 13:51
do processes preserve the output order?
Paolo Di Tommaso
@pditommaso
Aug 16 2017 13:52
process are executed in parallel hence the execution order is not deterministic
Simone Baffelli
@baffelli
Aug 16 2017 13:53
I should have thought of it
How stupid
I will insert a toSortedList somehwere
that should take care of it
Paolo Di Tommaso
@pditommaso
Aug 16 2017 13:54
if you use of collect is already order safe
Simone Baffelli
@baffelli
Aug 16 2017 13:54
i need to order before buffering
because the lenght of the buffer depends on the sequence of dates
Paolo Di Tommaso
@pditommaso
Aug 16 2017 13:55
likely ..
Simone Baffelli
@baffelli
Aug 16 2017 13:55
it is a bit annoying because I need to wait for everything to be done before moving on
Paolo Di Tommaso
@pditommaso
Aug 16 2017 13:55
exactly
Simone Baffelli
@baffelli
Aug 16 2017 13:56
is it less efficent in terms of paralelization?
or +/- the same?
Paolo Di Tommaso
@pditommaso
Aug 16 2017 14:01
well, that's a bottleneck, because step after cannot advance before toSortList collect of the items
Simone Baffelli
@baffelli
Aug 16 2017 14:02
that is what I was fearing as well :fearful:
OTH, all the other processes upstream will be executed
only the final summarizing step will be delayed
Paolo Di Tommaso
@pditommaso
Aug 16 2017 14:02
yes
Simone Baffelli
@baffelli
Aug 16 2017 14:03
but thats annoying as well
Paolo Di Tommaso
@pditommaso
Aug 16 2017 14:04
you cannot sort an incomplete set ..
Simone Baffelli
@baffelli
Aug 16 2017 14:04
I know!
I will have to be patient
that's it
nik-sm
@nik-sm
Aug 16 2017 18:06
hi - I'm using nextflow to build an artifact of reference data files that I'll use in downstream nextflow pipeline. As I work on it, I will make small changes to this build_reference.nf script, and I'd like to re-run only the affected portion of the dependency graph. It brings me to two design questions that I'm sure others have dealt with. 1) When I'm doing small changes to the nextflow source like this, how can I re-run only the affected part of the dependency graph? (I know that I could simulate this behavior manually, by splitting the file into many smaller parts, iterating development on each part, and combining them at the end, but I'm hoping there's a more clever way to do this). 2) For a downstream pipeline, sample_pipeline.nf, I'd like to add in a fingerprint method for each input file, and then use cached results for some portions but re-run any processes that would be affected. It's like I would run with an option --smart-resume. Note that a solution to #2 could solve both of my use cases - if I can input the "current" set of fingerprints of some files, and the "previous" fingerprints, and then decide which processes need to run, then when I run sample_pipeline.nf, i'll be providing the fingerprints of reference data files, and when I run build_reference.nf, i'll provide the fingerprints of groovy snippets (nextflow processes). Please let me know if I'm not explaining clearly. Thanks!!
Paolo Di Tommaso
@pditommaso
Aug 16 2017 18:22
not sure to understand, NF already compute a fingerprint for process for the given input files and command script
as soon as an input or the script change, that process is re-execute and therefore all downstream processes that depends on it
if you want to have a finer control, you can create a separate (BASH/whatever) script for each task and test them separately
nik-sm
@nik-sm
Aug 16 2017 18:28
it may be that there's an existing functionality I'm missing, but here's what I mean. In build_reference.nf let's say I have this dependency structure: A -> B-> C, D-> E. I run v1 of the script. Then, I change process D to make v2 of the script, and re-run. It will still re-run A-> B -> C as well. (I'm not surprised, because I have not set any flags or done anything to tell NF that my current execution of v2 relates to that previous execution of v1, and if it fingerprints the whole file, then it appears I am running an unrelated nextflow file)
does that make sense?
Paolo Di Tommaso
@pditommaso
Aug 16 2017 18:29
are you using -resume option ?
nik-sm
@nik-sm
Aug 16 2017 18:29
yes
(--resume = -resume?)
Paolo Di Tommaso
@pditommaso
Aug 16 2017 18:30
no
nik-sm
@nik-sm
Aug 16 2017 18:30
whoops
image.png
Paolo Di Tommaso
@pditommaso
Aug 16 2017 18:30
double dash option are interpreted as user parameter and NF ignore them :)
nik-sm
@nik-sm
Aug 16 2017 18:31
ok - thanks for helping me debug!
Paolo Di Tommaso
@pditommaso
Aug 16 2017 18:31
tho it's true that's a bit confusing, we need to improve that somehow
nik-sm
@nik-sm
Aug 16 2017 18:32
so actually 2 things to clarify, this "execution history" is based on the name of the NF file? and how about for input files - is it based on checksum of the file?
Paolo Di Tommaso
@pditommaso
Aug 16 2017 18:33
so actually 2 things to clarify, this "execution history" is based on the name of the NF file?
what do you mean exactly t?
nik-sm
@nik-sm
Aug 16 2017 18:36
what attributes can change and still successfully do a -resume and avoid some parts of the execution; e.g. changing the *.nf file name, changing the name of processes inside, changing code inside some process. And for sensitivity to the input files, same question - how should it behave if I keep the same file contents in a different named file, or same name but different contents
Paolo Di Tommaso
@pditommaso
Aug 16 2017 18:38
for each process is created a unique key hashing the process name, the inputs and the script command
the main script name is irrelevant
nik-sm
@nik-sm
Aug 16 2017 18:39
ok excellent
since you've been so helpful, one last question if you don't mind? If I'm storing a reference file on s3, I thought that they do not provide API to get checksum of file remotely. Am I wrong about s3 remote checksums, or else how will NF decide whether or not some process needs to re-run?
Paolo Di Tommaso
@pditommaso
Aug 16 2017 18:44
it hashes the file metadata (full path, size, last modified)
optional you can create a checksum of the overall content
nik-sm
@nik-sm
Aug 16 2017 18:46
ok great. thanks again, very helpful!
Paolo Di Tommaso
@pditommaso
Aug 16 2017 18:46
you are welcome
klitgord
@klitgord
Aug 16 2017 19:43

Hello,
I am having a little issue getting autoscaling to work correcting in the nextflow cloud environment.

Master of my cluster come on line without issue, but never any workers, even when I submit a workflow.
My config for launch looks as follows:>>>
cloud {
imageId = 'ami-fab28181'
instanceType = 'm4.large'
keyName = 'Bioinfo_US_east_1'
securityGroup = 'sg-1bc4dc66'
userName = 'ubuntu'

autoscale {
    enabled = true
    spotPrice = 1.33
    minInstances = 1
    maxInstances = 2
    imageId = 'ami-fab28181'
    instanceType = 'r3.4xlarge'
    terminateWhenIdle = true
    instanceStorageMount = '/home/ubuntu/EDGE_output'
    instanceStorageDevice = '/dev/xvdc'
}

}

process {
// Set here according to resources available on r3.4xlarge
// Memory should be a little less than Amazon lists for node type
cpus = 16
memory = 120.GB
}
<<<

The nf file I am trying to run:>>>

!/usr/bin/env nextflow

params.in = "test.list.txt"
params.type = "strain"

sample_file = file(params.in)
sampletype = params.type

samples = sample_file.readLines()
sample_ch = Channel.from(samples)

process run_sample {

cpus 8
memory '120 GB'

input:
val sample from samples
val t from sampletype

output:
stdout result

"""
python ~/edge_tool_scripts/get_run_viomega_reprocess.py -s $sample -c $t
"""

}

result.subscribe { println it }
<<<

Note: works locally without the cpus/memory config

Sorry hit enter before I finished..

With the CPUS/Memory config on the cluster, no workers launch and get the following:
WARN: ### Task (id=1) requests an amount of resources not available in any node in the current cluster topology -- CPUs: 8; memory: 120 GB

without the CPUS/Memory config. It tries to run everything locally (on master that is), and no workers launch.
I am sure I am buggering it somehow, just not quite sure how...
Any suggestions?
kind regards,
Niels

Jean-Christophe Houde
@jchoude
Aug 16 2017 20:06
hi all, a quick question: in my output section, I have something like:
file "${bname}/${sid}__*endpoints_metric.nii.gz"`optional true
Since my process might not create those files if, for example, the source files are empty.
However, this doesn't seem to work as I expected. When no file is created, the process fails. In fact, it seems to fail on the final rsync call in .command.run, which should bring back the results from the /tmp directory to the work dir.
It works when the file path has no wildcard inside, which is the expected behavior.
Is there a way to overcome this behavior in the case of a wildcard?
Paolo Di Tommaso
@pditommaso
Aug 16 2017 21:36
not sure if it's bug
if you are able to replicate with a test case, please open an issue
@klitgord I'm not understanding
klitgord
@klitgord
Aug 16 2017 23:37

Sorry Paolo, thank you for reading, I am pretty new to nextflow. I will try to rephrase/clarify. For context, I am trying to get your nextflow tool to loop trough an arbitrary number of samples on aws in the nextflow cloud cluster frame work using the autoscale option.

I suppose there were really two parts to my question.
The first was in using the auto-scaling, no autoscale/worker nodes would come up regardless of the number of processes that were queued from a workflow or the minimal number of instances I would specify. I wasn't sure if maybe there was some other requirement for getting nextflow processes to a queue for autoscaling of additional resources to take effect. I did however notice that master would try to run my processes when I did not specify a minimal CPU or memory. Similarly, I would only get an error that my current cluster topology was inadequate when I did specify a minimal CPU and memory, and a new 'worker' node was not brought on line. I don't see any other errors so perhaps I need to configure something differently? Possibly on aws's side? do any specific ports need to be open other than ssh.

My second question, which could be related, is about how to properly specify available resources of a Node. When using the aws cloud option in next flow , I presume we need to put the CPU and memory requirements into a process and see that as an option in your documents. What was less clear was how to specify how much CPU/memory each compute resource, I found in the NGI-RNAseq workflow docs (https://github.com/ewels/NGI-RNAseq/blob/master/docs/amazon_web_services.md) a suggestion that I should put them in my config, is that correct? Is there perhaps a better/preferred way to do so? In a similar is there a way to view what resources nextflow currently knows it has?

regards and much thanks,
Niels