These are chat archives for nextflow-io/nextflow

19th
Oct 2016
Félix C. Morency
@fmorency
Oct 19 2016 13:40
morning.
@pditommaso do you ever sleep?
Evan Floden
@evanfloden
Oct 19 2016 13:41
@fmorency: We keep him locked in lab with the lights on
Paolo Di Tommaso
@pditommaso
Oct 19 2016 13:48
something like this
:D
Evan Floden
@evanfloden
Oct 19 2016 13:56

The above configuration enables the autoscaling features so that the cluster will include at least 5 nodes. If at any point one or more tasks spend more than 5 minutes without being processed, the number of instances needed to fullfil the pending tasks, up to limit specified by the maxInstances attribute, are launched.

Where is the 5 mins specified?

Paolo Di Tommaso
@pditommaso
Oct 19 2016 13:56
it's the default
Evan Floden
@evanfloden
Oct 19 2016 13:59
Cool, so if we start with 10 processes pending (and each take say 20 min to run), 5 will run at the beginning, then after 5 min, the other 5 will run?
Paolo Di Tommaso
@pditommaso
Oct 19 2016 14:00
yep
Evan Floden
@evanfloden
Oct 19 2016 14:00
What happens when the number of tasks drops below the minInstances?
Paolo Di Tommaso
@pditommaso
Oct 19 2016 14:01
they remain idle
Félix C. Morency
@fmorency
Oct 19 2016 14:03
following yesterday's fromFilePairs() example, can I split the resulting array outside the process?
something like set val(sid_for_test1), file(other), file(some) from test1_ch. because I have multiple "first tasks" and I don't use the full input files in any on them
Paolo Di Tommaso
@pditommaso
Oct 19 2016 14:05
yes, you can combine channel output as you need
Evan Floden
@evanfloden
Oct 19 2016 14:07
okay, and I would pay for the idle instances? So usually we would want a low minInstancesof say 1 (if most pipelines have a process made of 1 task) but with a short increaseInstanceTime so instances are not idle. Makes sense :+1:
Paolo Di Tommaso
@pditommaso
Oct 19 2016 14:08
of course you pay, for this reason the elastic scheduling helps you to save money!
(need to leave now)
Félix C. Morency
@fmorency
Oct 19 2016 14:33
works with separate()
Félix C. Morency
@fmorency
Oct 19 2016 14:39
it's unfortunate to have to split my experiment_id channel in {# of process} channels
Félix C. Morency
@fmorency
Oct 19 2016 16:00
is there a clean way to prepend the experiment_id to every output files?
Mike Smoot
@mes5k
Oct 19 2016 16:01

@pditommaso Hi Paolo, is there any support for treating S3 buckets as directories? I have this code:

Channel
    .fromPath( reads_dir + "/*.fastq.gz" )
    .into{ read_pair_lanes }

and if "reads_dir" is an S3 bucket, I get the following error:

ERROR ~ Relative path cannot be made absolute: sgi-pipeline-dev/nextflow-demo-data

I'm not sure if this is because my S3 bucket has a "subdirectory" or whether Channel.fromPath just doesn't support S3 buckets the way it does directories. Any advice would be appreciated!

Félix C. Morency
@fmorency
Oct 19 2016 16:15
can you limit the number of cpu taken by the local executor? (ie. run one task at the time even if they could be run in ||)
Mike Smoot
@mes5k
Oct 19 2016 16:21

@fmorency yes, in nextflow.config :

executor {
  $local {
      queueSize = num_jobs_in_parallel_you_want
  }
}

You can also control things on a process by process basis using directives: https://www.nextflow.io/docs/latest/process.html#directives

Félix C. Morency
@fmorency
Oct 19 2016 16:22
thanks
amacbride
@amacbride
Oct 19 2016 16:37
If I have a channel output that is a delimited string (say, "some:composite"), is there a way to have a subsequent reader of that channel operate on a subcomponent of that string? In Groovy, it would effectively be
cs.tokenize(":")[0]
(if cs is the composite string)
I'm trying to figure out if it's possible to do it as part of a channel input using filters.
Mike Smoot
@mes5k
Oct 19 2016 16:42
@amacbride Do you want all elements of your composite string like this: Channel.from("some:new:thing").flatMap{ it.tokenize(':') }.view() or do you just want to pluck one?
If just one, then use map instead of flatMap and pluck out what you need from the list.
amacbride
@amacbride
Oct 19 2016 16:44
@mes5k I'm trying to do it inline in the input declaration. So, for example, something like:
input:
val str1 from [...].tokenize(":")[0]
I can push this down into the actual script body, but it's messier, so I was wondering if there was a way to do an inline transformation of a channel value in the input declaration.

It looks like input declarations aren't in the same scope, as I tried

val str from cs
val str2 from str.tokenize(":")[0]

but str isn't visible to the second declaration. I didn't think it would work, but thought it was worth a try :)

Mike Smoot
@mes5k
Oct 19 2016 16:49
I think at the input declaration [...] is still a channel, so I'd guess any channel operator would work. So something like [...].map{ it.tokenenize(":")[0]} might do the trick
amacbride
@amacbride
Oct 19 2016 16:49
Ooh, I'll give that a try. Thanks!
Perfect!
cs = Channel.from ("some:composite")

process sayHello {
  echo true

    input:
        val s2 from cs.map{ it.tokenize(":")[1]}

    script:
        "echo sayHello: ${s2}"
}
Mike Smoot
@mes5k
Oct 19 2016 16:55
:+1:
Paolo Di Tommaso
@pditommaso
Oct 19 2016 17:10
great community :)
any unanswered question ?
Mike Smoot
@mes5k
Oct 19 2016 17:11
mine! :)
Paolo Di Tommaso
@pditommaso
Oct 19 2016 17:11
:)
is there any support for treating S3 buckets as directories?
Mike Smoot
@mes5k
Oct 19 2016 17:12
right
Paolo Di Tommaso
@pditommaso
Oct 19 2016 17:12
what do you mean ?
Mike Smoot
@mes5k
Oct 19 2016 17:13

Should this work?

Channel
    .fromPath( reads_dir + "/*.fastq.gz" )
    .into{ read_pair_lanes }

if reads_dir is an S3 bucket?

Paolo Di Tommaso
@pditommaso
Oct 19 2016 17:13
yes
Mike Smoot
@mes5k
Oct 19 2016 17:13
Ok, awesome - it could be how our VPC is configured.
Paolo Di Tommaso
@pditommaso
Oct 19 2016 17:14
is it working the aws s3 client ?
Mike Smoot
@mes5k
Oct 19 2016 17:14
What I'll check next as soon as I'm out of the meeting I'm in
Paolo Di Tommaso
@pditommaso
Oct 19 2016 17:14
:+1:
Félix C. Morency
@fmorency
Oct 19 2016 17:44
well.... i fitted a full dMRI pipeline in 520 lines of NF + 100 lines configs/path.
Paolo Di Tommaso
@pditommaso
Oct 19 2016 17:45
how big was the previous implementation ?
Félix C. Morency
@fmorency
Oct 19 2016 17:46
i don't want to say :P
Paolo Di Tommaso
@pditommaso
Oct 19 2016 17:46
make me happy, please :D
amacbride
@amacbride
Oct 19 2016 17:47
@pditommaso Yes, I've really enjoyed working with NF, and both you and the community that have grown up around it have been very helpful.
Paolo Di Tommaso
@pditommaso
Oct 19 2016 17:47
that's great
Félix C. Morency
@fmorency
Oct 19 2016 17:47
our old tissue segmentation task had 91 lines. the new implementation has 25
and we don't have to deal with graph, dependencies, etc... essentially what NF does.
Paolo Di Tommaso
@pditommaso
Oct 19 2016 17:50
exactly !
just focus on the logic on your workflow
Félix C. Morency
@fmorency
Oct 19 2016 17:53
I could eliminate another 100 lines in the NF pipeline easily
the only thing i "don't like" is having to split my id channel into every process
it's cumbersome and it makes the dag (.png) ugly
Evan Floden
@evanfloden
Oct 19 2016 17:57
Is this because the id channel contains a value, and that value is used in several independant inputs? Ie. the same starting data enters the pipeline several times?
If so, you could make a small process for generating those channels
Félix C. Morency
@fmorency
Oct 19 2016 17:58
@skptic yes exacly. it's the "experiment id" channel which is the same across the workflow for one set of input files
i have something like id.into{id_for_process1; id_for_process2; ...; id_for_processN}
and in each process I have val id from id_for_processN
Evan Floden
@evanfloden
Oct 19 2016 18:01
I would create a small process, have as input val(exp_id) and each y from (1..N)
By having the output as exp_id, the each y in [0..N] effectivly multiples the channel
I’m happy to use nextflow consoleto create a small example if you need
Félix C. Morency
@fmorency
Oct 19 2016 18:03
i would be curious to see your solution in a small example
Evan Floden
@evanfloden
Oct 19 2016 18:04
Or even better use .cross()
I’ll cook something up, 2 min
Evan Floden
@evanfloden
Oct 19 2016 18:12
Do you only use the same id on each run?
Félix C. Morency
@fmorency
Oct 19 2016 18:12
the id is deduced from the input folder
or from the command-line
Evan Floden
@evanfloden
Oct 19 2016 18:52
Sorry, was dinner time, plus champions league set up required
expID='myID'
processNum=10

Channel
    .from(1..processNum)
    .flatMap{it -> expID}
    .set{exp_ids}

exp_ids
    .view()
This creates a channel called expID which emits myID 10 times
Félix C. Morency
@fmorency
Oct 19 2016 18:54
i see. thanks
can we create variables from input: content to be used in output: and script:?
Evan Floden
@evanfloden
Oct 19 2016 19:00
If expID is the same in in the same execution, just set it as a value
expID='myID'
list=[0,1,2,3,4,5]

process A {

    input:
    val (expID)

    script:
    echo $expID

}



process B {

    input:
    val (expID)
    each num from list

    script:
    echo $expID $num 

}
Sorry, major keyboard issues since upgrading to Sierra
Evan Floden
@evanfloden
Oct 19 2016 19:05
Process A in pretty standard, and you can use the val in many processes. With process B, you get the 6 tasks executed.
amacbride
@amacbride
Oct 19 2016 22:43
I had a similar question. In our pipeline, we start with a directory full of FASTQ files, which gets turned into a fastq channel passed into the alignment step. (Then about 30-40 additional steps in a large, multi-branched tree.) Currently, each step takes a sample_name as part of one of the input channels and uses it as a tag value, then passes it as output on the output channels, but I wasn't sure if the was a better/more elegant way to do it.
process alignment {
    tag { sample_name + ":" + lane_id }

    input:
            file reference
            set sample_name, sample_id, lane_id, file(read1), file(read2) from fastqs
    output:
        set sample_name, file ("${lane_id}.Aligned.sortedByCoord.out.bam") into alignments
        set sample_name, file ("${lane_id}.Chimeric.out.sam") into fusion_sams
        set sample_name, file ("${lane_id}.Chimeric.out.junction") into fusion_jxns
        set sample_name, file ("${lane_id}.Log.final.out") into star_logs
    [...]
}

process merge {
    tag { sample_name }

    input:
        set sample_name, file('bam') from alignments.groupTuple(sort: true)

    output:
        set sample_name, file('merged.bam') into merged1, merged2, merged3

    [...]
}

etc.
Paolo Di Tommaso
@pditommaso
Oct 19 2016 22:52
:sleeping:
Mike Smoot
@mes5k
Oct 19 2016 22:56
@amacbride Seems fine to me. I'm trying to imagine what would be more elegant...
amacbride
@amacbride
Oct 19 2016 23:30
@mes5k I wasn't sure either, so I figured I'd ask. I will check it off my vast "list of things to worry about unnecessarily" :smile: