These are chat archives for nextflow-io/nextflow

17th
Jul 2015
Peter Amstutz
@tetron
Jul 17 2015 13:36
hi @pditommaso , Peter from CWL project here
Paolo Di Tommaso
@pditommaso
Jul 17 2015 13:37
Hi Peter, twitter is a bit uncomfortable for conversations ..
Peter Amstutz
@tetron
Jul 17 2015 13:38
yes
Paolo Di Tommaso
@pditommaso
Jul 17 2015 13:38
what's your point about tasks parallelism ?
i've missed that
Peter Amstutz
@tetron
Jul 17 2015 13:41
so you have a data parallel operation to do task over data partitioned N ways
this is usually expressed as a "map" or "scatter/gather" operation
Paolo Di Tommaso
@pditommaso
Jul 17 2015 13:41
ok
Peter Amstutz
@tetron
Jul 17 2015 13:43
trying to understand how that's usually expressed in FBP
1 node that processes messages synchronously isn't parallel
10 nodes that do the same thing but on different parts of the data partition works, but isn't dynamic on the number of partitions
Paolo Di Tommaso
@pditommaso
Jul 17 2015 13:44
I don't know if this is defined rigorously by FBP
Samuel Lampa
@samuell
Jul 17 2015 13:45
There is a bit different opinions about this in the FBP community
Paolo Di Tommaso
@pditommaso
Jul 17 2015 13:45
However FBP is a specialisation of the dataflow paradigm
Samuel Lampa
@samuell
Jul 17 2015 13:45
Not everybody see the problem
Peter Amstutz
@tetron
Jul 17 2015 13:45
1 node could process messages asynchronously and spawn tasks, but that seems to require extensions to the model to reason about it
Samuel Lampa
@samuell
Jul 17 2015 13:46
But I think something that would make sense in my mind, is to let an FBP sub-network have the ability to spawn multiple workers inside itself, depending either on number of incoming packages, or a parameter
Paolo Di Tommaso
@pditommaso
Jul 17 2015 13:46
no wait, the dataflow model parallelisation is clearly defined
the node spawn parallel instances as soon as inputs are available
Samuel Lampa
@samuell
Jul 17 2015 13:47
In Luigi for example, we are using normal luigi tasks to contain networks, thus a kind of "sub-network". I have been blown away by how useful that is.
Complex dynamic network behaviour can thus be encapsulated in the convenient API of a normal task.
Thus, same API everywhere (inports, outports, and option ports), but enables custom dynamic network logic.
Peter Amstutz
@tetron
Jul 17 2015 13:48
@samuell yes you definitely want to have sub-networks/sub-workflows
@pditommaso ok so if the node spawns 10 processes how do you wait for them all to complete, say you have a merge step that's a barrier
Samuel Lampa
@samuell
Jul 17 2015 13:49
@tetron I guess you have something like that in CWL already? :)
Peter Amstutz
@tetron
Jul 17 2015 13:49
@samuell yes CWL includes subworkflows
Samuel Lampa
@samuell
Jul 17 2015 13:49
(I started reading up on stuff last night, but didn't get too far yet :) )
@tetron: :thumbsup:
Paolo Di Tommaso
@pditommaso
Jul 17 2015 13:49
@tetron It's not role of the node to wait for completion
What I'm doing in nextflow is having another node specialised for that
Peter Amstutz
@tetron
Jul 17 2015 13:50
so how do you implement a merge step?
so you do need to extend the model
Samuel Lampa
@samuell
Jul 17 2015 13:50
@tetron But I think a general theme about FBP vs Workflows is also that a lot of things that are left up to the process implementation in FBP, one would like to define declaratively in a workflow language like CWL.
Peter Amstutz
@tetron
Jul 17 2015 13:51
@samuell yes that's right
Paolo Di Tommaso
@pditommaso
Jul 17 2015 13:52
@tetron If you are referring to FBP you can consider an extension, but from a point of view of dataflow paradigm is not an extension
@tetron here there's an easy example
this line does the scatter
Peter Amstutz
@tetron
Jul 17 2015 13:53
well since we're trying to write a specification I'm looking for rigorously defined models and not "paradigms" :-)
then this process executes in parallel
finally it collect the results to a file
@tetron I see you point but does not a unique formal definition for dataflow
however this paper gives a great overview of many different approaches
Peter Amstutz
@tetron
Jul 17 2015 13:57
in that example where is "splitFasta" defined?
Paolo Di Tommaso
@pditommaso
Jul 17 2015 13:57
it's an operator provided by the framework. See here
Peter Amstutz
@tetron
Jul 17 2015 13:58
isn't that cheating? :-)
Paolo Di Tommaso
@pditommaso
Jul 17 2015 13:58
why? :)
Samuel Lampa
@samuell
Jul 17 2015 13:59
@tetron Well, possibly starting to see the "problem" with a general scatter/gather syntax now ...
Peter Amstutz
@tetron
Jul 17 2015 13:59
can you write splitFasta using just the capabilities of the nextflow language or does it go outside the framework?
Samuel Lampa
@samuell
Jul 17 2015 13:59
@tetron It is never a problem just to do a scatter/gather. But the main problem is to define one what data to do it, right?
Peter Amstutz
@tetron
Jul 17 2015 13:59
is it a "library" function or a "builtin" function
Samuel Lampa
@samuell
Jul 17 2015 14:00
@tetron That is, how many data packets, on which ports, to await before scattering out, no?
Paolo Di Tommaso
@pditommaso
Jul 17 2015 14:00
it's a buildin function, but you can provide your own library
Peter Amstutz
@tetron
Jul 17 2015 14:01
@samuell that's right, splitting is very data specific, might have 1 port that's reference data and 1 port that's the actual input data
that's the A, B example I gave on twitter
Samuel Lampa
@samuell
Jul 17 2015 14:01
@tetron Yes
Peter Amstutz
@tetron
Jul 17 2015 14:01
could repeat both A and B 10 times with same value of A and each value of B
Samuel Lampa
@samuell
Jul 17 2015 14:01
@tetron And then of course, whether you need a synchronized scatter/gather, or if a "fire on first packet arrival, don't care about order" is enough, and similar ...
@tetron An interesting concept in FBP related to that, is sub-streams. It might not be applicable in the workflow case though, but they have specialized "start substream" and "close substream" packages, that are sent together with the normal packages in the (ordered) packet stream.
... and use that to keep track of "compound sets of data packets" for various purposes.
But as said, not maybe super-applicable here :)
/me BRB, need more coffee
Peter Amstutz
@tetron
Jul 17 2015 14:06
@pditommaso does "collectFile" just append results to a file?
Paolo Di Tommaso
@pditommaso
Jul 17 2015 14:06
yes, it's specialised to merge to a file or to many files (given a key)
however the important thing is that this operators work exactly in the same way as task nodes
in the meaning that they are asynchronous
Peter Amstutz
@tetron
Jul 17 2015 14:08
@pditommaso so is that a barrier operation (executes once) or does it execute for each message?
Samuel Lampa
@samuell
Jul 17 2015 14:08
@tetron But for things like the reference genome file, what about solving it with a specialized task type, that always returns the same file? (Idea from Luigi, which has a special "ExternalTask" class, that just returns a file without doing anything).
Paolo Di Tommaso
@pditommaso
Jul 17 2015 14:08
for each message, for this reason I'm saying that they work as a task node
@tetron Currently the main differences is that process/task can spawn multiple instances
instead operators are sequential
Peter Amstutz
@tetron
Jul 17 2015 14:11
@pditommaso okay, that makes sense, but then how would you implement a BAM merge where you can't do anything until you have all the files?
@pditommaso okay, that makes sense, but then how would you implement a BAM merge where you can't do anything until you have all the files?
Paolo Di Tommaso
@pditommaso
Jul 17 2015 14:11
This message was deleted
Peter Amstutz
@tetron
Jul 17 2015 14:13
@samuell yes that works but doesn't really capture the user's intent
Paolo Di Tommaso
@pditommaso
Jul 17 2015 14:13
@tetron currently we do that collecting (with an operator) all the BAM files, than with a task that get that list as a single packet, and running samtools
Peter Amstutz
@tetron
Jul 17 2015 14:14
@pditommaso ok, so how does the operator know how much data to collect? :-)
Paolo Di Tommaso
@pditommaso
Jul 17 2015 14:16
@tetron nice question! until the channel does not terminate i.e. emits a special signal that stops the operator and then, all the network
Samuel Lampa
@samuell
Jul 17 2015 14:16
@pditommaso Sorry to interrupt the discussion, but just wondering: So, could one say that an operator is a specialized task type that has extra strong support in the DSL? (Why I'm asking is because what I find so nice with dataflow is that you can always solve everything with another specialized "wrapper" process ... such as one that collects data, converts it, sorts it, or whathever, to fit into the next process :) )
Paolo Di Tommaso
@pditommaso
Jul 17 2015 14:17
@samuell definitively
Peter Amstutz
@tetron
Jul 17 2015 14:17
@pditommaso ahhhhh, so there's an explicit "channel close"
@pditommaso that's the piece I was missing
Paolo Di Tommaso
@pditommaso
Jul 17 2015 14:18
@tetron not in the meaning the you have to invoke a close operation
@tetron but yes, when there's no more data (for example the splitFasta operator will emit a stop and the end of the splitting)
@samuell that is exactly the point, nextflow uses the dataflow programming model to handle the tasks interactions and provides a set of operators i.e. specialised task for recurrent operations or channels manipulation
Peter Amstutz
@tetron
Jul 17 2015 14:23
@pditommaso I see so when you get "out of data" you get a channel close msg which propagates downstream as each node finishes its work
@pditommaso that's elegant
Paolo Di Tommaso
@pditommaso
Jul 17 2015 14:25
@tetron yes, exactly. In principle a pipeline could run forever if a termination is never emitted
that means that I need to change my presentation of nextflow for the next talk ;)
Peter Amstutz
@tetron
Jul 17 2015 14:27
@pditommaso I should apologize I was in the audience for your talk but not paying attention, I was hacking on a Galaxy tool to CWL converter ;-)
@pditommaso would really like not only for CWL to run on Nextflow but at least consider if Nextflow could be a DSL that compiles to CWL
Paolo Di Tommaso
@pditommaso
Jul 17 2015 14:29
@tetron no problem at all, indeed it's pretty to hard to explain in few minutes
@tetron that would be really cool, but it looks to me very hard
I would like to have a kind of import tool for CWL in Nextflow
but compiling nextflow to CWL it looks to me very challenging
Peter Amstutz
@tetron
Jul 17 2015 14:32
that's right, CWL in the current form probably wouldn't work, but there's interest in extending it to accommodate flows
@pditommaso @samuell ok have to go and do work now, but thanks for the conversation, very enlightening
Samuel Lampa
@samuell
Jul 17 2015 14:34
@tetron @pditommaso Thanks you too! Same here, learning much from these discussions!
Paolo Di Tommaso
@pditommaso
Jul 17 2015 14:34
thanks a lot to both of you
it has been a pleasure chatting with you.