These are chat archives for nextflow-io/nextflow

8th
Mar 2017
Tiffany Delhomme
@tdelhomme
Mar 08 2017 10:00
Hi all, does anybody know if I can use one of its commit IDs to run nextflow?
Evan Floden
@evanfloden
Mar 08 2017 10:01
I am pretty sure you can:
nextflow run nextflow-io/hello -r $COMMIT_ID
Assuming you are referring to the commit id of a NF project on github
Tiffany Delhomme
@tdelhomme
Mar 08 2017 10:03
but this should run the hello pipeline $COMMIT_ID not the nextflow one...
actually I want to run the last commit of nextflow
I guess if no released I can't
Paolo Di Tommaso
@pditommaso
Mar 08 2017 10:06
actually I want to run the last commit of nextflow
what do you mean ?
Evan Floden
@evanfloden
Mar 08 2017 10:07
I think like $NXF_VER=$latest_commit
Is what you were meaning?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 10:07
version or commit??
Tiffany Delhomme
@tdelhomme
Mar 08 2017 10:11
commit, I want to use last collect() operator, nextflow-io/nextflow@d978d5d, will try $NXF_VER=$latest_commit but I guessed this only worked for released
Evan Floden
@evanfloden
Mar 08 2017 10:12
Yeah, I'm guessing that is what you wanted. Paolo will let you know :wink:
Paolo Di Tommaso
@pditommaso
Mar 08 2017 10:12
ok, you are talking about the version not the commit :)
first run this to download the latest snapshot
NXF_VER=0.24.0-SNAPSHOT CAPSULE_RESET=1 nextflow info
then run NF as shown below
NXF_VER=0.24.0-SNAPSHOT nextflow run .. etc
Tiffany Delhomme
@tdelhomme
Mar 08 2017 10:15
huum ok thanks I will do that... but just one question, why I can use 0.24.0 nextflow version if not in the github? here
Paolo Di Tommaso
@pditommaso
Mar 08 2017 10:16
because it's an unreleased version only for testing
Tiffany Delhomme
@tdelhomme
Mar 08 2017 10:19
I guessed a version was necessarily released... thank a lot! :+1:
Paolo Di Tommaso
@pditommaso
Mar 08 2017 10:20
thanks to you, I'm also interested in your feedback !
Tiffany Delhomme
@tdelhomme
Mar 08 2017 12:39

needlestack does not work anymore with this snapshot, I get this

ERROR ~ Error executing process > 'split_bed'

Caused by:
  A DataflowVariable can only be assigned once. Only re-assignments to an equal value are allowed.

Relating this small process...

This appears only if I have several files *_regions in
file '*_regions' into split_bed mode flatten
Paolo Di Tommaso
@pditommaso
Mar 08 2017 13:22
oops, let me check
Paolo Di Tommaso
@pditommaso
Mar 08 2017 13:32
@tdelhomme I've uploaded a new version, do as before
NXF_VER=0.24.0-SNAPSHOT CAPSULE_RESET=1 nextflow info
then run NF as shown below
NXF_VER=0.24.0-SNAPSHOT nextflow run .. etc
Tiffany Delhomme
@tdelhomme
Mar 08 2017 13:41
That's ok! awesome :smile: I will check collect() usage
Paolo Di Tommaso
@pditommaso
Mar 08 2017 13:41
great
Maxime Garcia
@MaxUlysse
Mar 08 2017 13:44
From what I understand, collect() should replace toList(), right ?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 13:44
yes
Tiffany Delhomme
@tdelhomme
Mar 08 2017 13:50
Ok no problem with the new collect() :+1:
Stian Soiland-Reyes
@stain
Mar 08 2017 14:36
@heuermh sounds quite interesting with a Nextflow/CWL/Spark/ADAM route! I don't think long-running services have been addressed.. or even in scope. Do you mean like in BPM/BPEL, say waiting for a user or external event to continue the workflow?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 14:37
where are moving forward a dreamflow :)
Stian Soiland-Reyes
@stain
Mar 08 2017 14:38
I guess asynchronous services would be in scope.. that fits well into a push model. Nextflow's channels can be used like that?
say where a result can appear out of "thin air" and trigger stuff further down
Paolo Di Tommaso
@pditommaso
Mar 08 2017 14:39
I don't think so, moreover IMO workflow definition should not take care of setting up the infra
Stian Soiland-Reyes
@stain
Mar 08 2017 14:40
I agree on that
we've seen lots of abuse of "workflows" that did that kind of thing in the past
a workflow that sets up itself.. it's no longer flowing!
Paolo Di Tommaso
@pditommaso
Mar 08 2017 14:40
exactly
Stian Soiland-Reyes
@stain
Mar 08 2017 14:44
it's good to stay focused on the domain problems, and not overbake everything into the same system just because one COULD. One project is looking at BPMN/BPEL-like outer workflows, which can include things like "Human to check the sample quality" and a CWL/Nextflow-like analytical workflow -- when it is finished, the outer "business" (aka research) workflows continues.
Paolo Di Tommaso
@pditommaso
Mar 08 2017 14:46
I agree on this separation of concerns
Changing topic: I've just uploaded a new snapshot 0.24.0-SNAPSHOT including a small enhancement to nextflow DSL that's very handy in some cases
Félix C. Morency
@fmorency
Mar 08 2017 14:53
Nice @pditommaso
Michael L Heuer
@heuermh
Mar 08 2017 15:51

@stain We have a pipeline written in Toil here
https://github.com/BD2KGenomics/toil-scripts/blob/master/src/toil_scripts/adam_gatk_pipeline/align_and_call.py

Its runtime is dominated by long BWA and GATK HaplotypeCaller steps. In-between we use Toil Services to bring up Apache Spark and Hadoop HDFS services. Then we run our ADAM transformation jobs and tear the services down. Instantiating a large number of nodes for the Spark cluster at the beginning before BWA runs and waiting to tear it down until after HaplotypeCaller runs could be expensive when running on cloud services.

Michael L Heuer
@heuermh
Mar 08 2017 16:00
That is kind of our old way of thinking though; as mentioned in the presentation above, a Spark-only variant calling pipeline with BWA-on-Spark, ADAM, and a variant caller on Spark runs much faster and is easier to deploy.
What we would like to work towards in CWL is a way to make it easy to integrate our stuff (and Spark stuff more generally) in traditional pipelines built by the community.
Stian Soiland-Reyes
@stain
Mar 08 2017 16:41
Yes, that is a good example of a more complex cloud pipeline.. and so I understand you only need the Spark/HDFS services for parts of the overall workflow?
but ideally they should be ready just before those steps start
Félix C. Morency
@fmorency
Mar 08 2017 16:48
how do I set a variable if a channel is empty?
Stian Soiland-Reyes
@stain
Mar 08 2017 16:48
so could we think of it as a kind of service requirement for a subworkflow? You just need "a Spark service" spun up somehow before that part can run - after that it is no longer needed. The workflow engine would need to know (or be configured) for ways to fulfil that requirement, but that doesn't have to be part of the workflow definition; although a suggestion for how to do it could accompany it. It's not very different from requiring a certain Docker image.
Félix C. Morency
@fmorency
Mar 08 2017 16:50
setting the variable in the ifEmpty closure doesn't work
Paolo Di Tommaso
@pditommaso
Mar 08 2017 16:51
do you mean how write a conditional statement ?
Félix C. Morency
@fmorency
Mar 08 2017 16:51
yes, based on if a channel is empty or not
Paolo Di Tommaso
@pditommaso
Mar 08 2017 16:52
if ( channel.isBound() ) {  .. }
Félix C. Morency
@fmorency
Mar 08 2017 16:53
Oooh! Is this in the doc somewhere?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 16:53
but it looks weird, you should not have such test in your flow
Félix C. Morency
@fmorency
Mar 08 2017 16:54
I have an optional input file/process in the middle of my flow
If the input file is not given on the command-line and/or not detected by .fromPath, I need to run an additional process that will output the said not-given file
Paolo Di Tommaso
@pditommaso
Mar 08 2017 16:56
have you tried having a process with a when condition ?
umm
even easier
your_channel = params.your_file ? Channel.fromPath(params.your_file) : Channel.empty()

process foo {
  input: file x from your_channel
..
}
then foo is executed only when the input is provided
Félix C. Morency
@fmorency
Mar 08 2017 17:00
In my case, I need to execute foowhen the input is not provided
and foo will produce said not-provided input
Paolo Di Tommaso
@pditommaso
Mar 08 2017 17:01
reverse, but same logic
Félix C. Morency
@fmorency
Mar 08 2017 17:01
yea
thanks
Paolo Di Tommaso
@pditommaso
Mar 08 2017 17:02
:+1:
Tim Diels
@timdiels
Mar 08 2017 17:18
"Nextflow only requires that the main script in your pipeline project is called main.nf"
Can you have other scripts then? There is no include directive, is there?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 17:19
you can name it as you like, the only benefit using main.nf that you can run it directly for a remote repo as GitHub
if you use a different name, you can specify it in the config file by using mainScript setting
no, no include directive
Tim Diels
@timdiels
Mar 08 2017 17:23
Would this work evaluate(new File("something.nf")) as an include? Or perhaps when named "something.groovy"?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 17:25
there's not such evaluate function, or is this a feature request ? :)
Tim Diels
@timdiels
Mar 08 2017 17:30
I assumed nf was an extension of Groovy files and Groovy apparently has evaluate (I don't know Groovy)
Paolo Di Tommaso
@pditommaso
Mar 08 2017 17:31
oops you are right, I had forgot that :)
but that would just execute plain groovy script, won't help so much
I guess you want to include subworkflows
Tim Diels
@timdiels
Mar 08 2017 17:34
Yes. If I were to run it as a sub-workflow, I would have sub-optimal use of the number of cores that are available on the cluster as I would like nextflow to restrain itself in the number of SGE cores it uses, for which there's a config option.
I expect that the sub workflow and the main workflow do not communicate with each other about the max number of cores to use, so I would have to allocate cores to the subworkflow from the main workflow, whether it will use all of them or not.
Paolo Di Tommaso
@pditommaso
Mar 08 2017 17:37
this is one of the reasons why implementing a subworflow mechanism is quite challenging, for this reason we have preferred to postpone this feature for now
Tim Diels
@timdiels
Mar 08 2017 17:38
So, alternatively, I would include the subworkflow into the main workflow via an include statement and then the cores limit is shared.
Paolo Di Tommaso
@pditommaso
Mar 08 2017 17:39
what include statement ?
Tim Diels
@timdiels
Mar 08 2017 17:39
Feature request? :P
Paolo Di Tommaso
@pditommaso
Mar 08 2017 17:39
ahah
where are thinking something like that, but there's not a defined schedule for that
Tim Diels
@timdiels
Mar 08 2017 17:42
Ok, please give it my +1 then, thanks
Paolo Di Tommaso
@pditommaso
Mar 08 2017 17:42
moreover I think that most of times this is a requirement based on experience on general purpose programming language
Tim Diels
@timdiels
Mar 08 2017 17:43
I admit I've only just started converting our current pipeline to Nextflow
Paolo Di Tommaso
@pditommaso
Mar 08 2017 17:44
I've found very few cases in which subworkflow was needed
Félix C. Morency
@fmorency
Mar 08 2017 17:57
    when:
    in_brain_mask.isBound() == true
is there a reason for ^ not to work?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 17:57
nope
I mean, it's not the way it's supposed to work either when and isBound
use your params in the when clause
Félix C. Morency
@fmorency
Mar 08 2017 18:00
it's not a param, it's a Channel.fromPath(**/*/stuff)
or am I missing something?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 18:01
well, isBound is a low-level API, I don't think it can be used in the context of a process
Félix C. Morency
@fmorency
Mar 08 2017 18:05
I am scanning a root folder for input file and if brain_maskis detected, I'm skipping a task. If the channel is empty, I'm running an additional process
The only script parameter I have is --input
Paolo Di Tommaso
@pditommaso
Mar 08 2017 18:08
what something like this
Paolo Di Tommaso
@pditommaso
Mar 08 2017 18:14
Channel.value().set { trigger }
Channel.fromPath('**/*/stuff').ifEmpty { trigger << 'do_it' } .set { ch1 }

process foo {
  input: val x from trigger 
  '''
  optional code
  '''
}
Félix C. Morency
@fmorency
Mar 08 2017 18:18
mmm
ill play with that
Félix C. Morency
@fmorency
Mar 08 2017 19:26
The foo process takes input from other processes. The flow simply blocks when using ^
Paolo Di Tommaso
@pditommaso
Mar 08 2017 19:29
it should not stop
Félix C. Morency
@fmorency
Mar 08 2017 19:31
it block between foo and the downstream task
Paolo Di Tommaso
@pditommaso
Mar 08 2017 19:32
try to isolate the use case in smaller script
Félix C. Morency
@fmorency
Mar 08 2017 19:32
sure, sec
Tim Diels
@timdiels
Mar 08 2017 19:42
Is it safe for 2 nextflow runs to share the same work directory?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 19:42
yes
Tim Diels
@timdiels
Mar 08 2017 19:43
Nice
cd to Result_WithFoo and run nextflow run ../trigger.nf --input ../Data/WithFoo/
The flow will block after the Bar task
Everything is okay if you cd to Result_WithoutFoo and run nextflow run ../trigger.nf --input ../Data/WithoutFoo/
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:04
umm
you are right, there's a problem but that's suggesting a nice extension to ifEmpty
Félix C. Morency
@fmorency
Mar 08 2017 20:06
can my use-case be supported one way or another using the current release of nextflow?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:06
let me think a bit
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:14
replace
Channel.value().set{ trigger }
in_foo = Channel
    .fromPath("$input/**/foo")
    .map{ ch1 -> [ch1.parent.name, ch1] }
    .ifEmpty{ trigger << "do_it" }
    .set{ ch1 }
with
trigger = file("$input/**/foo") ? Channel.value('do_it') : Channel.empty()

in_foo = Channel
    .fromPath("$input/**/foo")
    .map{ ch1 -> [ch1.parent.name, ch1] }
    .set{ ch1 }
it works by doing so
then I would surely remove

if(foo_for_Other.isBound()) {
    in_foo.into{foo_for_Other}
}
Félix C. Morency
@fmorency
Mar 08 2017 20:17
what's the clean way of using the output from the process instead of the (empty) input and vice versa?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:18
are you referring the last snippet ?
Félix C. Morency
@fmorency
Mar 08 2017 20:19
yes
the WithFoo should execute both Bar and Other processes
WithoutFoo shoud execute all three processes
amacbride
@amacbride
Mar 08 2017 20:21

This is sort of an odd question, but I'm doing some cleanup work, and was wondering if it was possible to have a process create and pass a channel on to subsequent processes? (Rather than having it in the main body of the script.)

I currently have a fairly complicated channel declaration (with a long series of regexes, maps, groupTuples, etc.) in the script body, and would love to wrap it as a process.

Would I essentially do it as,

output:
      Channel.fromPath( channelspec ).toSortedList(etc.) into mychannel

...or something similar?

Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:25
what do you mean? are you trying to create a library ?
amacbride
@amacbride
Mar 08 2017 20:28
No, just trying to get most anything with any complexity into a process, for tidiness and trackability. (For example, the DAG generation leaves a bunch of random text on the diagram when there are things outside processes, it's more uniform in the tracing output, etc.)
Mike Smoot
@mes5k
Mar 08 2017 20:28
@pditommaso since it's come up twice today, I'll remind you about: nextflow-io/nextflow#238 :) I believe that approach would solve both @amacbride and @timdiels use cases...
amacbride
@amacbride
Mar 08 2017 20:28
But yes, it would be useful as a modularity exercise as well.
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:29
Thanks Mike ! :)
however that's look like for more a chain of operators, in that case you can easily to it into a separate library file
Mike Smoot
@mes5k
Mar 08 2017 20:32
I'd love to hear your feedback on the idea if only to know if you see any insurmountable problems with it.
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:33
It should be feasible, however I would not like to introduce the concept module
ideally any workflow should be able to be incorporated by another workflow script
Mike Smoot
@mes5k
Mar 08 2017 20:36
So you'd like any main.nf script to be seamlessly incorporated into another main.nf script?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:37
yes
amacbride
@amacbride
Mar 08 2017 20:37
I think I'm asking a slightly different question: is it possible to declare a channel using a process?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:37
a process creates a channel by definition, but not in the way you are suggesting.
Mike Smoot
@mes5k
Mar 08 2017 20:38
@pditommaso I'd like that too!!! @amacbride sorry for hijacking your question...
amacbride
@amacbride
Mar 08 2017 20:39
Hrm. OK. So a process can only put things into a channel, not define it (other than implicitly?)
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:39
@mes5k need to find a way to define the workflow input/output in a consistent way
@amacbride not understanding clearly, what you are trying to do
do you want to have a custom rule on the output channel creation ?
amacbride
@amacbride
Mar 08 2017 20:44

No, I want the output channel of a process to be the channel I define with (brace for ugliness)

Channel.fromPath( channelspec )
              .toSortedList()
              .flatMap()
              .map { path -> [((path.baseName =~ /([-\w]*)_(\w*)_(\w*)_(\w*)_(\w*)/)[0][2]) + '.' + ((path.baseName =~ /([-\w]*)_(\w*)_(\w*)_(\w*)_(\w*)/)[0][3])[-1..-1], path ] }
              .groupTuple()
              .map { k,v -> ['sample_name': ((v[0].baseName =~ /([-\w]*)_(\w*)_(\w*)_(\w*)_(\w*)/)[0][1]),
                              'sample_id': ((v[0].baseName =~ /([-\w]*)_(\w*)_(\w*)_(\w*)_(\w*)/)[0][2]),
                              'lane_id': 'L' + ((v[0].baseName =~ /([-\w]*)_(\w*)_(\w*)_(\w*)_(\w*)/)[0][3])[-1..-1],
                              'read1_path': v[0],
                              'read2_path': v[1]] }

I'm currently creating this in the main body of the script and setting to a channel variable ("fastqs") I then later pass into my alignment process.

I'd rather have a process getFastqs { } that has fastqs as its output parameter, so that everything is nice and clean.

(or at least less hideous)
Félix C. Morency
@fmorency
Mar 08 2017 20:45
i mixed the empty and non-empty channel using .mix() and it worked
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:45
ok but the main source of that channel are not the files produced by that process ?
Mike Smoot
@mes5k
Mar 08 2017 20:45
agreed - that's what I was going for with the idea of input_channels and output_channels, but that clearly misses all the IO with the outside world, which is to say params, creating the initial channels, etc.
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:46
@fmorency fantastic idea!
@mes5k one chance is to add to the workflow object two new input / output scopes
but at the same it I would keep the ability to define the params ?
amacbride
@amacbride
Mar 08 2017 20:48
@pditommaso Not sure what you mean. channelspec is a string, representng a Unix path, which then goes through this declaration to generate a channel with 5 members, parsed from the filenames themselves. (sample_name, sample_id, lane_id, read1_path, read2_path)
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:50
what you mean for channelspec in that snippet
just a path ?
amacbride
@amacbride
Mar 08 2017 20:50
Yes.
Félix C. Morency
@fmorency
Mar 08 2017 20:50
@pditommaso thank's a lot for your time
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:50
welcome
amacbride
@amacbride
Mar 08 2017 20:51
It boils down to "/nfs/some/long/path/*.fastq.qz"
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:52
what if you do something like
def getFastqs( channelspec ) {

Channel.fromPath( channelspec )
              .toSortedList()
              .flatMap()
              .map { path -> [((path.baseName =~ /([-\w]*)_(\w*)_(\w*)_(\w*)_(\w*)/)[0][2]) + '.' + ((path.baseName =~ /([-\w]*)_(\w*)_(\w*)_(\w*)_(\w*)/)[0][3])[-1..-1], path ] }
              .groupTuple()
              .map { k,v -> ['sample_name': ((v[0].baseName =~ /([-\w]*)_(\w*)_(\w*)_(\w*)_(\w*)/)[0][1]),
                              'sample_id': ((v[0].baseName =~ /([-\w]*)_(\w*)_(\w*)_(\w*)_(\w*)/)[0][2]),
                              'lane_id': 'L' + ((v[0].baseName =~ /([-\w]*)_(\w*)_(\w*)_(\w*)_(\w*)/)[0][3])[-1..-1],
                              'read1_path': v[0],
                              'read2_path': v[1]] }

}
amacbride
@amacbride
Mar 08 2017 20:53

And I what I get out of it are tuples like:

[sample_name:A11, sample_id:S1, lane_id:L1, read1_path:/nfs/A11_Fusion-001_S1_L001_R1_001.fastq.gz, read2_path:/nfs/n/A11_Fusion-001_S1_L001_R2_001.fastq.gz]

Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:53
?
you can use as
amacbride
@amacbride
Mar 08 2017 20:53
Ah, so wrap it as a Groovy function instead of a NF process?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:54
getFastqs( "/nfs/some/long/path/*.fastq.qz") . set { your_channel_name }
exactly
that function returns the channel object that you can use as you need
amacbride
@amacbride
Mar 08 2017 20:55
I'll give it a try. I suspect it won't actually do what I want (which is to have this look nice on the DAG output and show up in the trace file) but it's at least cleaner.
Paolo Di Tommaso
@pditommaso
Mar 08 2017 20:55
ahhhhh
nope
that would be a nice extension
that I would call custom operators
amacbride
@amacbride
Mar 08 2017 21:00
I will let you know how it goes. Thanks!
Paolo Di Tommaso
@pditommaso
Mar 08 2017 21:00
:+1:
Mike Smoot
@mes5k
Mar 08 2017 21:08
@pditommaso I'm not sure I understand enough of the codebase (yet!) to fully understand what you mean by input and output scopes of workflow, but it seems like we're thinking in a similar direction. And I completely agree that params needs to be kept.
Paolo Di Tommaso
@pditommaso
Mar 08 2017 21:10
I mean
workflow.input { x; y; z }
workflow.output{ etc .. }
in this context
Mike Smoot
@mes5k
Mar 08 2017 21:13
Where is the workflow object defined in the source tree?
Mike Smoot
@mes5k
Mar 08 2017 21:16
Cool, I'll ponder that code for a while to make sure I understand how it gets created, populated, etc.
I think I really need to understand more about groovy DSLs...
Félix C. Morency
@fmorency
Mar 08 2017 21:24
is there a way to output the same file to multiple publishDir?
Paolo Di Tommaso
@pditommaso
Mar 08 2017 21:57
Nope
Félix C. Morency
@fmorency
Mar 08 2017 21:59
Ok. Made a noop process that do what I need