These are chat archives for nextflow-io/nextflow

24th
Jul 2018
LukeGoodsell
@LukeGoodsell
Jul 24 2018 12:05 UTC
Hi, is it possible to enable docker for a workflow from within the script, without either passing a command line parameter or creating a nextflow.config file?
Paolo Di Tommaso
@pditommaso
Jul 24 2018 12:05 UTC
nope
LukeGoodsell
@LukeGoodsell
Jul 24 2018 12:05 UTC
Ok, thanks
Paolo Di Tommaso
@pditommaso
Jul 24 2018 12:06 UTC
welcome
LukeGoodsell
@LukeGoodsell
Jul 24 2018 12:07 UTC
Is it possible to get the name of a docker image supplied from -with-docker [image] from within the script?
Alexander Peltzer
@apeltzer
Jul 24 2018 12:08 UTC

Process A produces something on STDOUT, I need a value from there (actually an S3 https:// address) - how do I do that?

Process A:


    output:
    stdout into s3_url

Process B:

input:
val stdout from s3_url 

script:
    proper_s3_url = grep -e "^https:\/\/\S*" < echo "${sc_stdout}"
....
Paolo Di Tommaso
@pditommaso
Jul 24 2018 12:10 UTC
@LukeGoodsell nearly, it's available using the workflow object, it's only significative if you use a single container
@apeltzer what do you mean, how to read the output from the s3_url channel ?
LukeGoodsell
@LukeGoodsell
Jul 24 2018 12:11 UTC
That works for me, thanks
Alexander Peltzer
@apeltzer
Jul 24 2018 12:16 UTC
@pditommaso the tool i call in Process A produces something like this:
bin/score-client url --object-id <Object ID>
Resolving URL for object: ddcdd044-adda-5f09-8849-27d6038f8ccd (offset = 0, length = -1)
https://s3-external-1.amazonaws.com/45250342340asdlkasdl34lkjq4lkjsdlkjs/ndflsdf9349234
I need to fetch the s3 URL (3rd column) and thought about using a regex to get that from the expression
(and use it in Process B to get the data, do something with it and get rid of the data again...)
LukeGoodsell
@LukeGoodsell
Jul 24 2018 12:18 UTC
@apeltzer :
input:
file sc_stdout from s3_url 

script:
    proper_s3_url = grep -e "^https:\/\/\S*" "${sc_stdout}”
Paolo Di Tommaso
@pditommaso
Jul 24 2018 12:19 UTC
I would avoid a process just for a grep
you can do also with a map
s3_url.map { url -> /some regexp here/ }
LukeGoodsell
@LukeGoodsell
Jul 24 2018 12:20 UTC
Although you’d also need an assert to ensure a match, right?
Paolo Di Tommaso
@pditommaso
Jul 24 2018 12:21 UTC
well, depend by the app, not sure what he wants to do
Alexander Peltzer
@apeltzer
Jul 24 2018 12:25 UTC
just wget the found URL and run featureCounts on it after that ;-)
Paolo Di Tommaso
@pditommaso
Jul 24 2018 12:26 UTC
if so yes use a process
Alexander Peltzer
@apeltzer
Jul 24 2018 12:26 UTC
Ok
LukeGoodsell
@LukeGoodsell
Jul 24 2018 14:43 UTC

The docs for the -C option seeem to be incorrect. On this page it says
If you want to ignore any default configuration files and use only the custom one use the command line option -C <config file>
However, when I use it, I get:

Unknown option: -C -- Check the available commands and options and syntax with 'help'

The output of nextflow help also lists the -C option.

Paolo Di Tommaso
@pditommaso
Jul 24 2018 14:44 UTC
nextflow -C <config> run .. etc
LukeGoodsell
@LukeGoodsell
Jul 24 2018 14:47 UTC
I see. Any idea how to get that to work when nextflow is called via the shebang line, just like -c does?
Paolo Di Tommaso
@pditommaso
Jul 24 2018 15:13 UTC
um, because the shebang execution only support the runcommand
and pass it as nextflow run -c <config>
LukeGoodsell
@LukeGoodsell
Jul 24 2018 15:34 UTC
Is it possible to enable docker for some processes but not others? I can see how with labels and the config file’s withLabel option I can choose different containers for different processes, but I’d like to be able to apply a label to processes that should not run in docker.
Paolo Di Tommaso
@pditommaso
Jul 24 2018 15:35 UTC
if the container is not specified for a certain process, won't use docker even if it's enabled
Mike Smoot
@mes5k
Jul 24 2018 16:20 UTC
Hi @pditommaso if I want to add a mount point to the job container running in AWS batch, do I need to write my own job definition or is there another way? Just setting docker.runOptions or process.containerOptions doesn't seem to do the trick.
Egon Willighagen
@egonw
Jul 24 2018 16:28 UTC
hi all, I'm looking for an example of two processes written in Groovy, the first splitting a file into lines, and the second running some Groovy code in a single line... something like that... I now got this non-working .nf: https://gist.github.com/egonw/884b30a49bea876969a19d6c105624a1
the main question is how I get output from the first process seen by the second process...
Paolo Di Tommaso
@pditommaso
Jul 24 2018 16:54 UTC
@mes5k Hi Mike, missed you question, nope, you need to use a custom job definition to handle custom mounts
@egonw the first task is not creating any output, therefore can't pass it to the second one
moreover when using exec: files have the be resolved agains task.workDir path
Mike Smoot
@mes5k
Jul 24 2018 17:02 UTC

Thanks @pditommaso, that's what I figured. Would you want a patch to add mount points? Something in nextflow.config like:

executor.mounts = [ [name: "efs", hostPath: "/mnt/efs", containerPath: "/mnt/efs"], ...]

And I'd add them like you do for aws-cli?

Paolo Di Tommaso
@pditommaso
Jul 24 2018 17:04 UTC
not sure, in principle there could be different mounts for each container, therefore should be defined in the executor
Mike Smoot
@mes5k
Jul 24 2018 17:04 UTC

@egonw

Channel.fromPath('simple.csv').splitText().set { masses }

process iterateMasses {
  input:
  val massRange from masses

  output:
  stdout into mch

  script:
  // I'm assuming that you want to do something computationally taxing here ...
  """
  echo "Mass $massRange"
  """
}

mch.view()

If you're not doing anything computationally intensive, then you could just use a map operator:

Channel.fromPath('simple.csv').splitText().map{line -> doStuff(line)}.set { output }

where doStuff is a normal groovy function.

Mike Smoot
@mes5k
Jul 24 2018 17:16 UTC
Right. I was imagining a default for executor and then a directive for a process that could override the default. Maybe a patch isn't worth the effort and a custom job definition is easier? I just worry that as nextflow evolves, my custom job definition will drift from the default.
Paolo Di Tommaso
@pditommaso
Jul 24 2018 17:18 UTC
Well the idea is to not override user job definitions
You can use aws cli to create it programmatically
Mike Smoot
@mes5k
Jul 24 2018 17:20 UTC
Interesting. Do that before the pipeline runs outside of nextflow or is that something I could do in nextflow?
Paolo Di Tommaso
@pditommaso
Jul 24 2018 17:21 UTC
Before you run the pipeline, as you need to create the compute environment, the queue, etc
Mike Smoot
@mes5k
Jul 24 2018 17:23 UTC
got it. If I provide a custom job def, does nextflow still update individual jobs for queues, cpu, and memory?
Paolo Di Tommaso
@pditommaso
Jul 24 2018 17:23 UTC
Yes
Mike Smoot
@mes5k
Jul 24 2018 17:26 UTC
Ok, maybe I'll just play with job definitions for now. I think that if I can get EFS mounted, then my pipelines should be able run mostly unmodified in Batch. Right now lots of flatfile databases are assumed to be available in a directory when running.
Stijn van Dongen
@micans
Jul 24 2018 17:36 UTC
(update: I'm trying the groupKey update, but our team is doing day-long sprints at the moment. I don't have it working yet but need more experimenting. I get this error in .nextflow.log: DEBUG nextflow.util.CacheHelper - [WARN] Unknown hashing type: class nextflow.util.GroupKey)
Egon Willighagen
@egonw
Jul 24 2018 17:46 UTC
@mes5k, no, I want to do expensive calls in Groovy and after each other... the line splitting is not the point...
so, two processes, both with exec: and Groovy code...
basically, what will happen is:
process 1 will take one input and creates thousands of outputs, and for each one of those I want to do another expensive bit of Groovy code
Paolo Di Tommaso
@pditommaso
Jul 24 2018 17:47 UTC
@micans comment in the related issue regarding this, please
Egon Willighagen
@egonw
Jul 24 2018 17:49 UTC
@pditommaso, yes, that's exactly part of my question... how do I get the first Groovy code to create (chunked) output for the second (thousands) of processes
Paolo Di Tommaso
@pditommaso
Jul 24 2018 17:51 UTC
you can use splitters such as splitText or splitCsv
without the need of a process for that
Egon Willighagen
@egonw
Jul 24 2018 17:51 UTC
ok, let me rephrase it...
I start with a list of input values, for each values it calculates an undefined number of output values
for each output value it will again calculate an undefined number of output values
sorry, I tried to keep it a bit more practical, but that does not explain my use case well :/
Paolo Di Tommaso
@pditommaso
Jul 24 2018 17:53 UTC
good, you have already a code for that or do you need to recode in NF ?
Egon Willighagen
@egonw
Jul 24 2018 17:53 UTC
I want to code this in NF...
my above example was a sad attempt...
Paolo Di Tommaso
@pditommaso
Jul 24 2018 17:55 UTC
for each values it calculates
for lightweight computation you can just use map
if it's a time consuming compute, i.e. you need to parallelise it, use a process with a repeater
Egon Willighagen
@egonw
Jul 24 2018 18:00 UTC
yes, but all these examples use command line tools... and/or input from starting variables...
but maybe I must pass everything via files...
Paolo Di Tommaso
@pditommaso
Jul 24 2018 18:02 UTC
NF is a superset of groovy, that's a superset of java
therefore you can use plain java/groovy code if you are happy with that
Egon Willighagen
@egonw
Jul 24 2018 18:03 UTC
but then I don't get the batching and chunking, or?
Paolo Di Tommaso
@pditommaso
Jul 24 2018 18:04 UTC
what do want to chunk? a text file ?
or a custom format?
Egon Willighagen
@egonw
Jul 24 2018 18:09 UTC
output from the previous (Groovy) step
Mike Smoot
@mes5k
Jul 24 2018 18:10 UTC
I think Egon's question is how to run Groovy within a normal process such that the process outputs its data onto a channel. Since exec runs on the master node, I think the answer is:
process whatever {
   input:
   val x from inch

  output:
  stdout into outch

 script:
 """"
 #!/usr/bin/groovy
 // groovy code here that is computationally intensive and prints to STDOUT
 """
}
Egon Willighagen
@egonw
Jul 24 2018 18:12 UTC
@mes5k, ah, ok... so, if I would use exec: here, it would not be able to run parallel...
ok, I'm starting to get it...
@mes5k ok, and I just have that groovy script create multiple whatever_ output files, over which I can then iterate in the next step with something like:
input:
Paolo Di Tommaso
@pditommaso
Jul 24 2018 18:14 UTC
I think, Mike is trying to simplifying
Egon Willighagen
@egonw
Jul 24 2018 18:14 UTC
file 'whatever_*' into nextProcChunks
Mike Smoot
@mes5k
Jul 24 2018 18:15 UTC
always! :)
Paolo Di Tommaso
@pditommaso
Jul 24 2018 18:15 UTC
using exec: you can parallelise plain groovy code
but again, your code has to take to save output in the task.workDir path
script: does automatically, exec: doesn't (mostly due to java file api limitation)
Egon Willighagen
@egonw
Jul 24 2018 18:17 UTC
ah, useful info... that explains a lot (and makes a lot of sense, once you know it is like that)
what's the best code then in an exec: to write to a file? like this?:
file = "${task.workDir/chunk_$something"
file < "some output"
something like that?
Paolo Di Tommaso
@pditommaso
Jul 24 2018 18:21 UTC
if you are fluent with java task.workDir is a Path
therefore
def file = task.workDir.resolve('foo.txt')
then
file.text = "some output"
or
file < "some output"
tho not so sure regarding the last, it should be allowed by groovy
have a look at file i/o here
Brad Langhorst
@bwlang
Jul 24 2018 20:48 UTC
is there a dry-run option for nextflow that will print a listing of what will be run?
Mike Smoot
@mes5k
Jul 24 2018 20:55 UTC
No dry-run option. The DAG that nextflow eventually evaluates is dynamically created based on the input and how the data get processed while running, so nextflow can't tell you what it will do a priori.
The -with-dag option will dump the DAG that was generated for a particular run.