These are chat archives for nextflow-io/nextflow

4th
May 2017
Phil Ewels
@ewels
May 04 2017 07:33
Good morning! Another AWS question for you..
In your demo / docs you enter a subnetId - what is this and where does it come from?
I get the imageId and I presume that sharedStorageId is from an EFS that I create in the AWS console beforehand
A quick google suggests that subnetId is something to do with VPCs, that I haven't played with yet and don't really understand
Maxime Garcia
@MaxUlysse
May 04 2017 07:36
Yesterday's conference was inspiring?
Phil Ewels
@ewels
May 04 2017 07:36
Hah, yup! Trying to get the RNA pipeline to run on AWS with the nextflow tools :+1:
Paolo Di Tommaso
@pditommaso
May 04 2017 07:46
That's not strictly mandatory, but most of instance types requires a subnet ID to be specified
You need to create a VPC which in turn will create some subnets in your zone
Phil Ewels
@ewels
May 04 2017 07:50
ok cool - just trying to write some docs for our pipeline with this and trying to make it as simple as possible :)
is there an obvious error if one is required and not specified?
any pattern to white instance types need it?
Paolo Di Tommaso
@pditommaso
May 04 2017 07:51
yes, it throws an exception if need and it's not specified
frankly I don't remember which types, you can find in the AWS docs
let me know how does it proceed :)
Phil Ewels
@ewels
May 04 2017 07:52
ok, thanks :+1: I'll probably just say to always do it - simpler in the long run
AWS docs get so complicated so fast for a n00b like me.. When you create a subnet, you specify the CIDR block for the subnet, which is a subset of the VPC CIDR block.. whaaat? :P
Sentences here where I don't recognise about 50% of the words. I guess I should probably try to take a step back :)
Apparently I already have all of this stuff created in my account - I guess something somewhere automatically did it all for me
Aha. "Accounts created after 2013-12-04 support EC2-VPC only.". Also apparently you're given a default VPC by default when you play with EC2 and stuff.
Phil Ewels
@ewels
May 04 2017 07:58
Also - if I tell Nextflow to pull input data from s3 and push results to s3, do I need to use an EFS mount?
Presumably everything can just run on the local EBS?
Paolo Di Tommaso
@pditommaso
May 04 2017 09:04
well, ESB is a remote storage by definition
to run your task your data need to be stored in a posix file system eg the local disk or the shared efs file system
Phil Ewels
@ewels
May 04 2017 09:32
ok - and advantages of EFS is that it persists if the instance goes down for whatever reason.. anything else? seems like it would be (slightly) cheaper to avoid using EFS?
and can just use bootStorageSize = 'lots'? Or maybe then that instance size is big enough to be expensive, and cheaper / less likely to fail with EFS
hmm, ok. More I think about it, EFS seems safer as can't run out of local space
Paolo Di Tommaso
@pditommaso
May 04 2017 09:37
it should be better when using spot instances, however it would needed a serious benchmarking to asset the best cost/performance combination provided by S3 and/or EFS
it could be an interesting topic for September meeting :)
Phil Ewels
@ewels
May 04 2017 09:38
I guess another downside would be that I'd need to manually delete the work directory etc. if using EFS. If it's local storage then it automatically vanishes when the instance is shut down
Paolo Di Tommaso
@pditommaso
May 04 2017 09:40
true, but there are use cases in which makes sense to persist that result to resume the execution later
so IMO, it depends
wait, work directory needs to be always shared EFS or S3 (unless you use a single instance)
so you need to cleanup in both cases
Phil Ewels
@ewels
May 04 2017 09:42
ah ok, good point :+1:
that settles that then
Paolo Di Tommaso
@pditommaso
May 04 2017 09:43
but S3 has an automatic cleanup feature and it's cheaper .. :)
Phil Ewels
@ewels
May 04 2017 09:45
oh that's nice, hadn't thought of that
so could create a personal s3 bucket for all work directories for all pipelines with an object expiration setting on it so that they get automatically deleted when they're a week old or something?
cool :sunglasses:
Paolo Di Tommaso
@pditommaso
May 04 2017 09:46
Yep
Phil Ewels
@ewels
May 04 2017 09:46
I will try that :+1:
The fewer steps to remember the better
Paolo Di Tommaso
@pditommaso
May 04 2017 09:48
exactly !
Maarten van Gompel
@proycon
May 04 2017 10:13
I have a script scripts/foo.pl in my git repository (relative to main.nf) which I want to invoke from within a nextflow process in a script block (rather than using templates); how do I get the right basepath to invoke that script?
Paolo Di Tommaso
@pditommaso
May 04 2017 10:14
I suggest to move into the bin folder, doing it will be enough to reference it as foo.pl
if you want to keep there you can use $baseDir/scripts/foo.pl
Maarten van Gompel
@proycon
May 04 2017 10:15
great! thanks!
mitul-patel
@mitul-patel
May 04 2017 11:44
Nextflow is running processes in batches. How do i run process one by one.. one after another. and in the sequential manner as they appear in nextflow.nf file..
Maarten van Gompel
@proycon
May 04 2017 11:55
just tie the output of one to the input of the other, that determines the order
Phil Ewels
@ewels
May 04 2017 12:07
Quick question - can I set Nextflow environment variables inside a nextflow config file, or is this too late?
eg env.NXF_WORK = '/path/to/work/directory'
Paolo Di Tommaso
@pditommaso
May 04 2017 12:14
too late .. :/
Phil Ewels
@ewels
May 04 2017 12:15
ok, I thought that would be the case. Worth asking :)
Paolo Di Tommaso
@pditommaso
May 04 2017 12:15
well done
Phil Ewels
@ewels
May 04 2017 12:15
Any other nice ways to tell a pipeline to use s3 as the work directory?
Paolo Di Tommaso
@pditommaso
May 04 2017 12:16
nextflow run .. -w s3://blah/blah
Phil Ewels
@ewels
May 04 2017 12:16
ok cool, that's nicer :+1: Thanks!
Paolo Di Tommaso
@pditommaso
May 04 2017 12:16
provided you are using the Ignite executor
(that's the default when using AWS)
Phil Ewels
@ewels
May 04 2017 12:17
yup, I've also set executor: 'ignite' in the config file I'm writing for aws
Paolo Di Tommaso
@pditommaso
May 04 2017 12:17
ok
Phil Ewels
@ewels
May 04 2017 12:18
just to be on the safe side :)
Hmm, just realised that I can't set all of the cloud defaults in the pipeline config file, as nextflow cloud create my-cluster -c 10 doesn't know about which pipeline I'm going to run
Paolo Di Tommaso
@pditommaso
May 04 2017 12:21
well, that configs should go in the pipeline config file
Phil Ewels
@ewels
May 04 2017 12:21
ok, so that can be done? How does nextflow know which pipeline to read the config file from?
eg. I currently have this:
cloud {
  imageId = 'ami-43f49030'
  instanceType = 'm4.large'
  // subnetId = 'subnet-05222a43'
  spotPrice = 1
  autoscale {
    enabled = true
    minInstances = 1
    maxInstances = 10
    instanceType = 'm4.2xlarge'
    spotPrice = 1
    terminateWhenIdle = true
  }
}
Paolo Di Tommaso
@pditommaso
May 04 2017 12:23
you need that only to setup the cluster
then you can have a cloud config profile in the pipeline repo
once you have sshed into the master node
you can run the pipeline in the usual way specifying the cloud profile/config
Phil Ewels
@ewels
May 04 2017 12:24
ok - so I need to tell people to create their own nextflow.config file by copying and pasting the above before they create the cluster?
then everything else should work as expected, I get that
it's just the cluster initialisation config stuff that I'm thinking about at the moment
this is fine actually - I'm already telling people to save the subnetId in their ~/.nextflow/config file, so I can just tell them to add it all
sorry for all of the questions :laughing:
Paolo Di Tommaso
@pditommaso
May 04 2017 12:32
yes, something like this
you need to provide that config as a template users can copy and paste to launch the cloud
we can brainstorming to a find a better workflow eventually, suggestions are welcome
Phil Ewels
@ewels
May 04 2017 12:35
sounds good :+1: Just working through it all myself and writing docs as I go. Will send you the link for comments / edits once I'm done :)
Paolo Di Tommaso
@pditommaso
May 04 2017 12:35
well done
Shellfishgene
@Shellfishgene
May 04 2017 12:47
@pditommaso Still working on adding our queuing system, it does not support -wd for working dir as SGE does, is it correct then to use the getHeaders method to add "cd ${quote(task.workDir)}\n" as the PBS executor does?
Paolo Di Tommaso
@pditommaso
May 04 2017 12:47
exactly
Phil Ewels
@ewels
May 04 2017 12:49
ssh -i /Users/philewels/.ssh/id_rsa philewels@ec2-52-214-169-188.eu-west-1.compute.amazonaws.com
ssh: connect to host ec2-52-214-169-188.eu-west-1.compute.amazonaws.com port 22: Operation timed out
:(
now, where to start..
Paolo Di Tommaso
@pditommaso
May 04 2017 12:50
:D
the instance is running ?
Phil Ewels
@ewels
May 04 2017 12:50
yup
Paolo Di Tommaso
@pditommaso
May 04 2017 12:50
let me check
Phil Ewels
@ewels
May 04 2017 12:51
Could be that I missed something with the subnet? I just copied one of the subnet addresses that I already had, pretty much at random
Paolo Di Tommaso
@pditommaso
May 04 2017 12:52
have you opened the port 22 in the security group ?
follow this guide
most common problem is the missing port in the security group and missing public IP address
Phil Ewels
@ewels
May 04 2017 12:54
blob
Paolo Di Tommaso
@pditommaso
May 04 2017 12:54
not enough, that's all in the same source
Phil Ewels
@ewels
May 04 2017 12:56
gotcha
Paolo Di Tommaso
@pditommaso
May 04 2017 12:56
found ?
Phil Ewels
@ewels
May 04 2017 12:56
ok, need to destroy and recreate I guess :)
Paolo Di Tommaso
@pditommaso
May 04 2017 12:56
no!
you can change it on fly
Phil Ewels
@ewels
May 04 2017 12:56
I'm in! :D
Paolo Di Tommaso
@pditommaso
May 04 2017 12:56
click on the instance security group
then click on in-bounds
Phil Ewels
@ewels
May 04 2017 12:57
blob
allow anyone, anywhere, to access anything, anywhere
seems legit :)
Paolo Di Tommaso
@pditommaso
May 04 2017 12:57
:)
better ?
Phil Ewels
@ewels
May 04 2017 13:02
Ok, I think I have an analysis running! :tada:
and there are files appearing in my s3 bucket :tada:
Paolo Di Tommaso
@pditommaso
May 04 2017 13:02
cool !
Phil Ewels
@ewels
May 04 2017 13:03
hah, I get so excited when stuff on aws works, it's like magic..
is there some way that I can see how many worker nodes there are?
I can only see the head node on the ec2 console I think..
Paolo Di Tommaso
@pditommaso
May 04 2017 13:04
well, on my side there are 5 years of work, on AWS some centuries ;)
yes, aws console
Phil Ewels
@ewels
May 04 2017 13:05
haha, and we salute you for your service! :bow:
Paolo Di Tommaso
@pditommaso
May 04 2017 13:05
happy about that :+1:
Phil Ewels
@ewels
May 04 2017 13:06
hmm, maybe I shouldn't get too excited
Paolo Di Tommaso
@pditommaso
May 04 2017 13:06
eheh
Phil Ewels
@ewels
May 04 2017 13:06
work directories all empty still
Toni Hermoso Pulido
@toniher
May 04 2017 13:06
Hello, I didn't find in doc. Is it possible to change log .nextflow.log filename ?
Phil Ewels
@ewels
May 04 2017 13:07

yes, aws console

You mean the main EC2 console dashboard?

Paolo Di Tommaso
@pditommaso
May 04 2017 13:07
yes
looks the cpu activity on there to see if something is running
hey toni
yes
Phil Ewels
@ewels
May 04 2017 13:08
blob
Looks good. Could be downloading the index files I guess
Paolo Di Tommaso
@pditommaso
May 04 2017 13:08
nextflow -log <something> run <script>
Toni Hermoso Pulido
@toniher
May 04 2017 13:08
Oh, thanks @pditommaso
Paolo Di Tommaso
@pditommaso
May 04 2017 13:08
and the docker images
@toniher :+1:
Phil Ewels
@ewels
May 04 2017 13:09
blob
aha! workers are now initialising.. cooool
Paolo Di Tommaso
@pditommaso
May 04 2017 13:10
ahahha
Phil Ewels
@ewels
May 04 2017 13:15
ok fantastic, everything seems to be working. Super nice!
Paolo Di Tommaso
@pditommaso
May 04 2017 13:16
first try, not so bad !
Phil Ewels
@ewels
May 04 2017 13:16
yeah, pretty happy with that :)
Getting loads of warnings that I'm asking for more cpus etc than is available. Is this a problem?
Will NF just use whatever is available and continue on?
Phil Ewels
@ewels
May 04 2017 13:23
hmm, maybe not..
Paolo Di Tommaso
@pditommaso
May 04 2017 13:25
nope
do you have processes requesting more cpus that avail in the instance ?
Phil Ewels
@ewels
May 04 2017 13:27
yup
now trying to think of a nicer way to configure this in the pipeline
requirements can vary massively according to reference genome
Paolo Di Tommaso
@pditommaso
May 04 2017 13:28
true, we need to improve this on NF side as well
Phil Ewels
@ewels
May 04 2017 13:29
I think I'll just set this in the user config file and leave it blank in my aws config
Phil Ewels
@ewels
May 04 2017 13:35
hmm, -resume doesn't seem to work with the work directory on s3 for some reason
oh, could be that I messed it up actually
Paolo Di Tommaso
@pditommaso
May 04 2017 13:35
umm, it should
Phil Ewels
@ewels
May 04 2017 13:36
I reran it without -resume by accident, cancelled and then reran again with -resume - I guess that stopped it from working?
Paolo Di Tommaso
@pditommaso
May 04 2017 13:36
in this case you need to specify the session id/name to resume
run
nextflow log
Phil Ewels
@ewels
May 04 2017 13:37
yup :+1: Was only FastQC so not too fussed
Paolo Di Tommaso
@pditommaso
May 04 2017 13:37
you can find it there
then specify it as resume argument
Phil Ewels
@ewels
May 04 2017 13:37
sorry, jumped to "it must not work" too quickly :flushed:
Paolo Di Tommaso
@pditommaso
May 04 2017 13:38
:D
Phil Ewels
@ewels
May 04 2017 13:42
hmm, still getting the same error:
WARN: ### Task (id=6) requests an amount of resources not available in any node in the current cluster topology -- CPUs: 8; memory: 32 GB
using m4.2xlarge for worker nodes
(m4.large for master)
Paolo Di Tommaso
@pditommaso
May 04 2017 13:43
um, and the other nodes ?
Phil Ewels
@ewels
May 04 2017 13:44
using m4.2xlarge for worker nodes
Paolo Di Tommaso
@pditommaso
May 04 2017 13:44
ops
so the same
Phil Ewels
@ewels
May 04 2017 13:45
hah, sorry - copy and paste error.. m4.large for master
do they have some overhead that means I shouldn't request their total capacity?
Paolo Di Tommaso
@pditommaso
May 04 2017 13:50
I've seen instances with nominal memory slight different from real one
can you try to login in a worker node and check how much memory is reported ?
Phil Ewels
@ewels
May 04 2017 13:51
I just set to 7 cpus and 30.GB memory (instead of 8 / 32) and it seems to be working now :+1:
Paolo Di Tommaso
@pditommaso
May 04 2017 13:52
quite surely the problem is the mem
Phil Ewels
@ewels
May 04 2017 13:53
MemTotal:        8178648 kB
MemFree:         6674420 kB
MemAvailable:    7247904 kB
(second ssh session, so stuff running on this node currently)
wait, that doesn't look right
Paolo Di Tommaso
@pditommaso
May 04 2017 13:53
indeed
Phil Ewels
@ewels
May 04 2017 13:54
what's a good way to check how much memory is reported?
everything I'm trying is giving the same numbers. eg. top: Mem: 8178648k total, 840968k used, 7337680k free, 26324k buffers
Paolo Di Tommaso
@pditommaso
May 04 2017 13:56
cat /proc/meminfo
Phil Ewels
@ewels
May 04 2017 13:56
yeah, that's what I did to get the first output
Paolo Di Tommaso
@pditommaso
May 04 2017 13:56
something not understanding if so
Paolo Di Tommaso
@pditommaso
May 04 2017 13:56
that would be 8 GB ?
what does it return curl http://169.254.169.254/latest/meta-data/instance-type?
Phil Ewels
@ewels
May 04 2017 13:59
m4.large
Paolo Di Tommaso
@pditommaso
May 04 2017 13:59
ah
so the mem is fine but not the instance type
that's the master or a worker ?
Phil Ewels
@ewels
May 04 2017 14:00
I thought it was a worker
I ssh into the master with the command from nextflow
then I copied the ip address of a worker from the console and logged into that: ssh ec2-34-251-224-65.eu-west-1.compute.amazonaws.com
Paolo Di Tommaso
@pditommaso
May 04 2017 14:00
need to leave now
Phil Ewels
@ewels
May 04 2017 14:01
ahhhaa
Permission denied
Didn't notice that, sorry :flushed:
so stayed on master
my bad

need to leave now

no problem - I'll have another go at getting the memory stats again and post here. Many many thanks for all of your help today!

(and sorry everyone else here for spamming)
Michael L Heuer
@heuermh
May 04 2017 14:13
@ewels No need to apologize! We're all eagerly awaiting the doc when you figure this out. :)
(and when you figure it out, see if I can convince you to swap out parts of your workflow for things running on Spark, but that's a conversation for a different gitter room)
Phil Ewels
@ewels
May 04 2017 14:15
hah, sounds good ;) I'm all ears!
ok, will leave this until tomorrow now but this is what I've written so far: https://github.com/ewels/NGI-RNAseq/blob/master/docs/amazon_web_services.md#3-nextflow-integration---elastic-clusters
PR open here if anyone has any comments: SciLifeLab/NGI-RNAseq#123
Maxime Garcia
@MaxUlysse
May 04 2017 14:29
:+1:
mitul-patel
@mitul-patel
May 04 2017 14:31
is it possible to get use different output channel each time the process runs? here is the code:

def pipeline (j) {
process index {

executor 'local'
cpus 4
memory '12 GB'
tag { "Genome: $genome_base" }

publishDir "$outDir", mode:'copy', overwrite: true

input:
file fasta from genome

output:
file "$outDir/iteration${j}/${genome_base}" into "INDEX${j}"

script:
"""

Refindex.py --itr ${j} --ref ${fasta} --out ${outDir}
"""
}
}

def itr = 1
while (itr <= iterations)
{
status_index = pipeline(itr)
itr++
}

Phil Ewels
@ewels
May 04 2017 14:33
@mitul-patel - probably not, why do you want to do this?
mitul-patel
@mitul-patel
May 04 2017 14:36
I am iterating process 10 times for different samples. Means for each sample process will run 10 times.
Phil Ewels
@ewels
May 04 2017 14:38
I have a feeling that one of the examples does this: https://github.com/nextflow-io/examples
I may be wrong..
But anyway - I guess that the best way to do this would be to use a variable to hold the count variable, as in https://github.com/nextflow-io/examples/blob/master/set_in_out.nf
eg:
output:
set $j, file ("$outDir/iteration${j}/${genome_base}") into outputchannel
then you have one output channel with 10 outputs from each sample
Phil Ewels
@ewels
May 04 2017 14:43
would that work?
mitul-patel
@mitul-patel
May 04 2017 14:45
let me give a try.....
mitul-patel
@mitul-patel
May 04 2017 14:55
no it didnt work... ERROR ~ Channel outputchannel has been used twice as an output by process index and process index
Phil Ewels
@ewels
May 04 2017 14:59
sounds like you've copied and pasted the same process twice maybe?
mitul-patel
@mitul-patel
May 04 2017 15:08
no i didnt... I am working on a pipeline which have 4 process. Each process has to run 10 times. but in sequential manner. means for iteration1 - run process1 then process2 then process3 and process4. iteraton2 - run process1 then process2 then process3 and process4 and so on......
I am trying while loop to iterate over def which have 4 processes in it.,...
if I dont specify output channel it will work...but the process will run randomly instead of in order.....
Phil Ewels
@ewels
May 04 2017 15:11
ah ok, that could do it. I think you want to use an input channel counter instead. Trying to find an example for you now...
I knew I'd read an example like this somewhere :)
so if you add an each channel into the first process, it will create 10 iterations. Then these can be passed on to the other channels as normal
or you can add an each to each process, but then you'll get 10*10*10*10 outputs (which may be what you want?)
Does that make sense?
mitul-patel
@mitul-patel
May 04 2017 15:25
i tried each before but didnt solve my problem. Process1 output will be input to process2. Process2 output will be input to process3. Process3 output will be input to process4. and process4 output will be input to process1. so they have to run in order 10 times...
Phil Ewels
@ewels
May 04 2017 15:26
yup, that should work fine I think?
just used the named output channel for p1 as the input channel for p2, and so on
p2 will start running as soon as there are any completed outputs from p1
Shellfishgene
@Shellfishgene
May 04 2017 15:40
I'm confused, when running make compile, make pack and make install the result is the "nextflow" bash script. I can copy that to a different server and it "works", where's the rest of the program?
Shellfishgene
@Shellfishgene
May 04 2017 15:51
Figured it out, kinda...
mitul-patel
@mitul-patel
May 04 2017 15:53
thanks Phil... i have now another question. When i use 'each' it doesnt start with 1. it starts with 5. I need to iterate each process from 1..10.. because each iteration has specific order...
Paolo Di Tommaso
@pditommaso
May 04 2017 16:23
@heuermh is doing Spark scouting in the NF chat :)
Phil Ewels
@ewels
May 04 2017 16:29
@mitul-patel - that's a tougher one! They'll all go off at once sort order will be semi-random and due to queueing
Maybe do a bash loop inside the process if you want them to run one at a time?
Or maybe @pditommaso has better suggestions?
Shellfishgene
@Shellfishgene
May 04 2017 16:34
@pditommaso How often does nextflow check on running jobs with qstat?
Félix C. Morency
@fmorency
May 04 2017 17:42
nextflow pull now gives me Authentication is required but no CredentialsProvider has been registered does this ring a bell to anyone?
it only does that with I give -r
Félix C. Morency
@fmorency
May 04 2017 17:50
NF doesn't seem to like branch names personal/user/someFeature
it works if i give -r SHA
Paolo Di Tommaso
@pditommaso
May 04 2017 19:06
@Shellfishgene every minute
@fmorency Sounds like a bug. Branch names can contain / ?
Paolo Di Tommaso
@pditommaso
May 04 2017 19:13
@mitul-patel you cannot iterator over process, you need to use streaming/functional approach to repeat the execution of a task
Félix C. Morency
@fmorency
May 04 2017 19:28
@pditommaso Yeah it seems like a bug with branch names containing /
Paolo Di Tommaso
@pditommaso
May 04 2017 19:38
Could you please open an issue providing more details on how to replicate the issue eg. provider (GitHub, BitBucket, etc), public or private host, etc.
Félix C. Morency
@fmorency
May 04 2017 19:39
Sure
Paolo Di Tommaso
@pditommaso
May 04 2017 19:39
Tx!