These are chat archives for nextflow-io/nextflow

10th
Nov 2016
Paolo Di Tommaso
@pditommaso
Nov 10 2016 00:08
ok
Johan Viklund
@viklund
Nov 10 2016 12:53
What's the reason for disallowing having a channel as output channel for several processes?
I can understand the input restriction
Johan Viklund
@viklund
Nov 10 2016 13:04
I guess simplicity in the nextflow library or something, but it does cause a lot of complexity in the workflows instead
Phil Ewels
@ewels
Nov 10 2016 13:06
channels are consumed by processes, so I guess it would require a pretty major change in core logic..
Johan Viklund
@viklund
Nov 10 2016 13:06
yes, for input, but output
now I have to create 2 extra channels and a mix operator
feels unnecesary
I don't know anything about the internals, but from my perspective it desn't feel like it should be a major change
there might be some locking logic that needs to be done, but that should already be there for the mix case anyway
Johan Viklund
@viklund
Nov 10 2016 13:15
no, there shouldn't be any locking either, each process just pushes one value at a time, and there might be several processes running of the same name with different inputs
I wonder what happens if I comment out this part :)
Paolo Di Tommaso
@pditommaso
Nov 10 2016 14:34
I miss the party ?! :)
Maxime Garcia
@MaxUlysse
Nov 10 2016 14:37
Hello,
I have an output from a process that contains:
[idPATIENT, idNORMAL, idTUMOR, tag, file]
I have only 1 idPATIENT
I'm supposed to have 1 idNORMAL
I have a couple of idTUMOR
and I have several different tags
I want to collect all these differents value and steps into something like:
[idPATIENT, idNORMAL, idTUMOR, [file, file, file...]]
So I tried a groupTuple(by:[0,1,2])
But I only got something like:
[idPATIENT, idNORMAL, idTUMOR, file, file, file...]
Is it possible to get it right?
Paolo Di Tommaso
@pditommaso
Nov 10 2016 14:42
let me give a look to it
Phil Ewels
@ewels
Nov 10 2016 14:45
I have what is hopefully a simple question.. :) Invalid groovy, and I don't know why:
cpus { params.star_cpus ?: 10 }
Paolo Di Tommaso
@pditommaso
Nov 10 2016 14:45
it should work
just tried this
Channel
  .from( ['a','z', file('a')], ['a','z', file('a')], ['b', 'q', file('b')], ['b', 'q', file('b')])
  .groupTuple(by:[0,1])
  .println()
retuns
[a, z, [/Users/pditommaso/projects/nxf-httpfs/a, /Users/pditommaso/projects/nxf-httpfs/a]]
[b, q, [/Users/pditommaso/projects/nxf-httpfs/b, /Users/pditommaso/projects/nxf-httpfs/b]]
Maxime Garcia
@MaxUlysse
Nov 10 2016 14:46
try cpus { params.star_cpus ? "" : 10 }
Phil Ewels
@ewels
Nov 10 2016 14:46
But that doesn't do what I want it to do :)
Paolo Di Tommaso
@pditommaso
Nov 10 2016 14:46
why is not working ?
Maxime Garcia
@MaxUlysse
Nov 10 2016 14:47
what do you want to do
Paolo Di Tommaso
@pditommaso
Nov 10 2016 14:47
I mean, is there an error ?
Maxime Garcia
@MaxUlysse
Nov 10 2016 14:47
Thanks @pditommaso, I'm trying
Phil Ewels
@ewels
Nov 10 2016 14:47
haha, fail. Sorry - false alarm, @Galithil was testing the wrong code source
Paolo Di Tommaso
@pditommaso
Nov 10 2016 14:49
now the big question
@viklund nextflow is based on the dataflow programming model and uses the GPars implementation
under the hood a channel is a dataflow queue and by definition when you read the a value for this queue, you consume that value
thus if you have two or more processes reading from the same channel it will result in a un-deterministic behaviour
because all of them will try to consume the same resource
Maxime Garcia
@MaxUlysse
Nov 10 2016 14:55
@pditommaso It was indeed working, but in fact, I'm doing a spread afterward to add one more value to the Channel, and it flattens it
Paolo Di Tommaso
@pditommaso
Nov 10 2016 14:56
um, thus? how I can help you ? :)
Tiffany Delhomme
@tdelhomme
Nov 10 2016 14:56

Does any one have an idea of how to combine two channels in input? i.e. parallelize on two sides:

    input:
    each file tn from tn_bambai
    each file bed from split_bed

I would like to run my process on every combinations of tn_bambai and split_bed channels

Paolo Di Tommaso
@pditommaso
Nov 10 2016 14:58
use a cross operator instead of the each inputs
Tiffany Delhomme
@tdelhomme
Nov 10 2016 14:59
yes but cross needs a matching key
Paolo Di Tommaso
@pditommaso
Nov 10 2016 14:59
oops
Tiffany Delhomme
@tdelhomme
Nov 10 2016 15:00
:smile:
each works well but not if I have files...
Paolo Di Tommaso
@pditommaso
Nov 10 2016 15:02
if so use spread
it will give you a warning but it's not a problem
Maxime Garcia
@MaxUlysse
Nov 10 2016 15:03
Before: [idPATIENT, idNORMAL, idTUMOR, [file, file, file, file]]
After = Channel.from('STRING').spread(Before)
After: [STRING, idPATIENT, idNORMAL, idTUMOR, file, file, file, file]

I was expecting: [STRING, idPATIENT, idNORMAL, idTUMOR, [file, file, file, file]]

I can put my STRING before doing the .groupTuple(by[0,1,2]), so no problem for me, but I don't think it is normal that a spred flattens the Channel like that

Paolo Di Tommaso
@pditommaso
Nov 10 2016 15:06
good point, you may want to open issue for that
Johan Viklund
@viklund
Nov 10 2016 15:13
@pditommaso that explains why I can't consume the same channel in two processes, and I accept that
but why can't I output to the same channel from several places?
Paolo Di Tommaso
@pditommaso
Nov 10 2016 15:14
not understanding
an example?
Johan Viklund
@viklund
Nov 10 2016 15:15
process A {
   output: file 'test' into ch
   """
   echo A > test
   """
}
process B {
    output: file 'test' into ch
   """
   echo B > test
   """
}
now I have to use 2 temp channels and then mix them
my real example is conditionally generating bam indexes if they don't exist with the specified bam input file
Paolo Di Tommaso
@pditommaso
Nov 10 2016 15:17
a bit long to explain, but I will try
Johan Viklund
@viklund
Nov 10 2016 15:17
I thought that the into ch basically just did the << into the channel
Paolo Di Tommaso
@pditommaso
Nov 10 2016 15:18
a process in the dataflow model just wait for data, when a new input(s) is read trigger the execution if the process
thus in principle it's endless execution
Johan Viklund
@viklund
Nov 10 2016 15:19
yes, so far I'm with you
Paolo Di Tommaso
@pditommaso
Nov 10 2016 15:19
for this reason each channel emits a stopping signal as the very last element to terminate the execution of the process/operator to which it's attached
Johan Viklund
@viklund
Nov 10 2016 15:20
(maybe my example should have included one input channel each for the processes)
aha, so it's the close operation that is the problem
Paolo Di Tommaso
@pditommaso
Nov 10 2016 15:20
when all processes have done, the pipeline (network) termites
yes,
in your snippet, both processes will generate a STOP signal, so you may have a channel producing
Johan Viklund
@viklund
Nov 10 2016 15:21
ok, to solve that you would need a counter on the channel that was decremented for each close() op and when it is zero you STOP for real
Paolo Di Tommaso
@pditommaso
Nov 10 2016 15:21
( .. , data, data, STOP, data, STOP )
Johan Viklund
@viklund
Nov 10 2016 15:21
But now I see why it's the way it is.
Paolo Di Tommaso
@pditommaso
Nov 10 2016 15:21
and the downstream process will lose the last data
Johan Viklund
@viklund
Nov 10 2016 15:22
This should be solvable with some extra code and a counter, right?
(it might be tricky to get it to work, I understand that, but in principle)
is it the underlying library that implements the STOP signals?
Paolo Di Tommaso
@pditommaso
Nov 10 2016 15:23
the solution would be to create the full DAG before launching the execution
but with the current implementation it's not possible
maybe in the version 2.0 ..
Johan Viklund
@viklund
Nov 10 2016 15:23
ahh
Paolo Di Tommaso
@pditommaso
Nov 10 2016 15:23
;)
is it the underlying library that implements the STOP signals?
yes
Johan Viklund
@viklund
Nov 10 2016 15:24
would it be possible to do it with lexical analysis of the source code? Or can channel names be dynamically generated?
oops, I'm late, see you later
Paolo Di Tommaso
@pditommaso
Nov 10 2016 15:35
are dynamically generated
Félix C. Morency
@fmorency
Nov 10 2016 18:14
@pditommaso remember the .fromFilePairs(...) { it.parent.name } trick you gave me?
how can I do the same with .fromPath()(single file)?
Paolo Di Tommaso
@pditommaso
Nov 10 2016 18:19
umm, frankly I don't remember it
how it was ?
Félix C. Morency
@fmorency
Nov 10 2016 18:21
(sid, file1, file2) = Channel
    .fromFilePairs("$input/**/*{stuff1,stuff2}.ext",
                                    size: 2,
                                    flat: true) { it.parent.name }
    .separate(3)
Paolo Di Tommaso
@pditommaso
Nov 10 2016 18:22
and what do you want to do?
Félix C. Morency
@fmorency
Nov 10 2016 18:22
same thing, but (sid, file1)
aka. single file + sid
.fromFilePairs() doesn't seem to work in that case
Paolo Di Tommaso
@pditommaso
Nov 10 2016 18:23
of course that is designed for pair or more files
much easier
Channel.fromPath('/some/path/glob*')
         .map {  file -> <your code>; return [ sid, file] }
map is a key operation to work with NF
Félix C. Morency
@fmorency
Nov 10 2016 18:25
and <your code> is?
Paolo Di Tommaso
@pditommaso
Nov 10 2016 18:26
well, the code you need to get the sid file a file
Félix C. Morency
@fmorency
Nov 10 2016 18:28
mmm
Paolo Di Tommaso
@pditommaso
Nov 10 2016 18:28
not convinced ? :)
Félix C. Morency
@fmorency
Nov 10 2016 18:30
yeah i just need to wrap my head around the groovy thing
Paolo Di Tommaso
@pditommaso
Nov 10 2016 18:31
I guess you want to extract the sid from the file name, right?
Félix C. Morency
@fmorency
Nov 10 2016 18:31
the path. the sid is it.parent.name
ie. the parent folder
Paolo Di Tommaso
@pditommaso
Nov 10 2016 18:32
eg. /a/b/c/name.txt
what would be /a/b/c or c ?
Félix C. Morency
@fmorency
Nov 10 2016 18:33
c
Paolo Di Tommaso
@pditommaso
Nov 10 2016 18:33
thus it.parent.name
Félix C. Morency
@fmorency
Nov 10 2016 18:33
.map{ f -> sid = file(f).parent.name; return [sid, f] }
Paolo Di Tommaso
@pditommaso
Nov 10 2016 18:34
you don't need file(f)
f is already a file
.map{ f -> sid = f.parent.name; return [sid, f] }
Félix C. Morency
@fmorency
Nov 10 2016 18:34
yay it works
Paolo Di Tommaso
@pditommaso
Nov 10 2016 18:34
:+1:
Félix C. Morency
@fmorency
Nov 10 2016 18:35
thanks :)
Paolo Di Tommaso
@pditommaso
Nov 10 2016 18:35
have a look to this
that's all you need to know of groovy
Félix C. Morency
@fmorency
Nov 10 2016 18:43
is there a way to change the max cpu of the local scheduler?
Paolo Di Tommaso
@pditommaso
Nov 10 2016 18:55
add the following line in the config file
executor.local.cpus = <n>
Félix C. Morency
@fmorency
Nov 10 2016 18:56
thanks!
Paolo Di Tommaso
@pditommaso
Nov 10 2016 19:06
welcome!
Mike Smoot
@mes5k
Nov 10 2016 19:11
Hi Paolo, does file() handle http URLs in the same transparent way that it handles S3 buckets? Possibly just for reading?
Paolo Di Tommaso
@pditommaso
Nov 10 2016 19:12
em ..
working on that !
Mike Smoot
@mes5k
Nov 10 2016 19:14
Wow, good timing. I'm mostly just curious at this point. I suspect I'll be handed urls as input at some point, so was checking. Thanks!
Paolo Di Tommaso
@pditommaso
Nov 10 2016 19:14
you are lucky guys :)
Mike Smoot
@mes5k
Nov 10 2016 19:15
indeed we are!
Paolo Di Tommaso
@pditommaso
Nov 10 2016 19:16
do you think ftp could be useful as well as ?
Mike Smoot
@mes5k
Nov 10 2016 19:18
Possibly. I'd guess it might be useful for supporting legacy downloads. I'm not sure I'd spend a lot of time on it, though.
Félix C. Morency
@fmorency
Nov 10 2016 19:24
executor.local.cpus doesn't seem to work. it takes all the avail cores
Paolo Di Tommaso
@pditommaso
Nov 10 2016 19:26
sorry it's
executor.$local.cpus = <n>
Félix C. Morency
@fmorency
Nov 10 2016 19:26
testing
it works! thanks!
Paolo Di Tommaso
@pditommaso
Nov 10 2016 19:28
:+1:
Jason Byars
@jbyars
Nov 10 2016 21:49
Is the ignitevisor supposed to be on the example AMI?
Paolo Di Tommaso
@pditommaso
Nov 10 2016 21:53
nope, actually I've never used it
but it could be a nice idea
Jason Byars
@jbyars
Nov 10 2016 22:02
ok, so aside from waiting for jobs to succeed or die, what was the idea to monitor the cluster?
Paolo Di Tommaso
@pditommaso
Nov 10 2016 22:03
the AWS console should report the cpu usage for each instance
Jason Byars
@jbyars
Nov 10 2016 22:04
yes, but what about functionality details like whether a worker actually downloaded the docker image, etc.
Paolo Di Tommaso
@pditommaso
Nov 10 2016 22:06
nope, this information are not available
Jason Byars
@jbyars
Nov 10 2016 22:06
the good news so far is the only additional details I had to specify beyond the example were the security group and subnetId, but that seems to be par for course for anything on AWS
Paolo Di Tommaso
@pditommaso
Nov 10 2016 22:06
maybe at some point there will be a NF console
Jason Byars
@jbyars
Nov 10 2016 22:06
ok, easy temp fix. After a pipeline runs or fails, does NF delete the container off the workers?
sorry I mean image
Paolo Di Tommaso
@pditommaso
Nov 10 2016 22:07
containers yes, images no
why do you need to delete the image? do u use many of them?
Jason Byars
@jbyars
Nov 10 2016 22:07
I don't need to, but if the image isn't there, then I know something is wrong.
Paolo Di Tommaso
@pditommaso
Nov 10 2016 22:09
well, if the pipeline fails you will see the details in the log file
Jason Byars
@jbyars
Nov 10 2016 22:12
yes, I get an error that the container is missing /bin/bash. I'm trying to make sure the container was retrieved properly before delving into that
Paolo Di Tommaso
@pditommaso
Nov 10 2016 22:13
ahh
you will need to use a different tag to force a new pull
also for this reason I prefer to use sha256 image id
Jason Byars
@jbyars
Nov 10 2016 22:18
np I'll just figure out what the sha256 is that for this image... I'm testing with the broadinstitute/picard:latest image,
Jason Byars
@jbyars
Nov 10 2016 22:32
super, it's a container with a weird entrypoint
Paolo Di Tommaso
@pditommaso
Nov 10 2016 22:32
the picard one ?
?
Jason Byars
@jbyars
Nov 10 2016 22:40
yes, it's making a little more sense now
when I see a .sh script entry point I assume bash is present somewhere in the container. Not necessarily true. I'll have to debug more later. Thanks!
Mike Smoot
@mes5k
Nov 10 2016 22:45
it may just be sh and not full bash
Paolo Di Tommaso
@pditommaso
Nov 10 2016 22:45
True