These are chat archives for nextflow-io/nextflow

3rd
Aug 2017
Maxime Garcia
@MaxUlysse
Aug 03 2017 07:44
I got one thanks to @ewels
Paolo Di Tommaso
@pditommaso
Aug 03 2017 07:46
ahaha
:+1:
Maxime Garcia
@MaxUlysse
Aug 03 2017 07:49
I now have a lot of stickers on my laptop
and most of them are from him
Paolo Di Tommaso
@pditommaso
Aug 03 2017 07:49
Do you have CAW stickers?
Maxime Garcia
@MaxUlysse
Aug 03 2017 07:50
We just received them like yesterday B-)
I'll give some to him so he'll pass them along
Paolo Di Tommaso
@pditommaso
Aug 03 2017 07:50
you are coming to the NF workshop in september, right?
Maxime Garcia
@MaxUlysse
Aug 03 2017 07:51
I can't, there was a biobank week here in Stockholm, and since I'm working for a Biobank, I had to go there, but @ewels will of course be there
Paolo Di Tommaso
@pditommaso
Aug 03 2017 07:53
I see, ok, at least give CAW stickers to Phil, so he will bring us :)
Phil Ewels
@ewels
Aug 03 2017 09:28
I'll come stocked up and ready 👍🏻
Ghost
@ghost~598345d2d73408ce4f6ff925
Aug 03 2017 15:49
Where is the best place to ask a question regarding Nextflow?
Paolo Di Tommaso
@pditommaso
Aug 03 2017 15:49
unless you want to come at Barcelona, I guess here :)
Ghost
@ghost~598345d2d73408ce4f6ff925
Aug 03 2017 15:51
Lol awesome. I want to know if you can modularize a pipeline script. I have an RNA-seq pipeline written in Nextflow but it's long (500 lines). What I'd like to do is put the processes in a separate file and then import them into a main.nf script for example
Sort of like importing a function from a local Python file
Is this possible?
Paolo Di Tommaso
@pditommaso
Aug 03 2017 15:53
you can have a NF process just executing a python script, sure
Ghost
@ghost~598345d2d73408ce4f6ff925
Aug 03 2017 15:58
Ah thanks! So it seems you can direct pipelines to execute python scripts but can you import an entire nextflow process into another nextflow file?
Paolo Di Tommaso
@pditommaso
Aug 03 2017 16:00
you could do
Ghost
@ghost~598345d2d73408ce4f6ff925
Aug 03 2017 16:00
For example write "process fastqc {input:, output:, script:} in one file, and import that entire process into another file and just say... run fastqc
Paolo Di Tommaso
@pditommaso
Aug 03 2017 16:00
process {
   """
   nextflow run .. etc
   """
}
but I won't suggest that
we are planning to implement a proper modularisation as you are suggesting
Ghost
@ghost~598345d2d73408ce4f6ff925
Aug 03 2017 16:04
Ok cool just wondering, thanks :)
Jeff Jasper
@_jasper1918_twitter
Aug 03 2017 17:12

I have my pipeline working in multiple environments: local, cluster, single AWS instance. Now I want to get it working with an ignite cluster in AWS. I'm having a hard time understanding how ignite pairs the worker to master node. AWS does not support multicast so I understand this may not be straightforwad. In order to get it to work I have to do these steps:

  1. launch cluster with config containing cluster.join = "path:/mnt/efs/joincast.
  2. ssh into each worker and mount efs(thought this was handled w/sharedStorageMount?) but it is only working for master
  3. run nextflow node -bg -cluster.join path:/mnt/efs/joincast
  4. ssh into master and run nextflow run <your pipeline> -process.executor ignite

IS this the recommended approach? Seems a bit tedious.

Félix C. Morency
@fmorency
Aug 03 2017 17:15
I don't know much about this, but have you looked at nextflow cloud?
iirc, this is the way to deploy an ignite cluster in AWS
Jeff Jasper
@_jasper1918_twitter
Aug 03 2017 17:21
I do use nextflow cloud create to start a cluster but my experience is that the state is not ready to deploy jobs since the master/worker nodes are not paired
Jeff Jasper
@_jasper1918_twitter
Aug 03 2017 17:24
Absolutely. The video gave me a good jump start. However I jumped in and quickly noticed the jobs were never being distributed. The nextflow logs suggested no workers could be discovered.....
Previous post from yesterday found that a shared workDir was necessary. In my case I am using EFS. Im not sure if 1)NF is expected to run this way or 2) Since EFS mounts are not taking place on workers that my cluster is failing to communicate
Félix C. Morency
@fmorency
Aug 03 2017 17:32
I don't know enough to help you. @pditommaso is the guy that can help you :)
Jeff Jasper
@_jasper1918_twitter
Aug 03 2017 17:37
Appreciate the help! I'll hang tight for Paulo.
Mike Smoot
@mes5k
Aug 03 2017 17:46
@_jasper1918_twitter do you have the sharedStorageId and sharedStorageMount specified in your nextflow.config? I seem to recall having a hard time getting nextflow cloud working with my own AMI, but then having much better luck with the one @pditommaso specifies in the tutorials.
Jeff Jasper
@_jasper1918_twitter
Aug 03 2017 17:49
Hey Mike. I do specify both those and the the EFS mount is successful on the master just not the worker. I am using my own AMI so definitely something to think about.
Michael L Heuer
@heuermh
Aug 03 2017 17:56
@mhalagan-nmdp may be seeing the same issue with nextflow cloud and EFS
Mike Smoot
@mes5k
Aug 03 2017 17:58
It was a while ago that I did this so I'm having trouble recalling the details, but I think I was seeing errors in the userdata log file of my worker nodes. I know at one point I also tried adding the EFS partition to my fstab, but that shouldn't be necessary. Back to the AMI - I think I was building my custom AMI from raw Centos which was missing a few magic AWS packages that Amazon Linux has preinstalled. So if you need a custom image consider starting with Amazon Linux.
Paolo Di Tommaso
@pditommaso
Aug 03 2017 18:02
1)NF is expected to run this way
you can manage the deployment on your own but it's more complicated
2) Since EFS mounts are not taking place on workers that my cluster is failing to communicate
if there's a problem you need to address it . .
Ghost
@ghost~598345d2d73408ce4f6ff925
Aug 03 2017 18:06
I have a question regarding intermediary files. Is there a good way to clean up (delete) files that are not needed anymore during the workflow? Is the best way to just make a process that specifically does this task?
Félix C. Morency
@fmorency
Aug 03 2017 18:06
nextflow clean
Ghost
@ghost~598345d2d73408ce4f6ff925
Aug 03 2017 18:06
Thank you!
Paolo Di Tommaso
@pditommaso
Aug 03 2017 18:07
or rm -rf work/
Félix C. Morency
@fmorency
Aug 03 2017 18:07
^ don't do that if you're using symlinks!
Paolo Di Tommaso
@pditommaso
Aug 03 2017 18:08
true .. :)
Jeff Jasper
@_jasper1918_twitter
Aug 03 2017 18:10
@pditommaso - Will Nextflow mount EFS on all instances(master AND worker) if I use sharedStorageId and sharedStorageMount? Need to understand if this is an issue related to environment or Nextflow design.
Paolo Di Tommaso
@pditommaso
Aug 03 2017 18:10
yes
sharedStorageMount is optional
the default mount path is /mnt/efs
Mike Smoot
@mes5k
Aug 03 2017 18:18
@pditommaso I've run into an interesting problem. At the end of a long pipeline I've got one process that aggregates something like 100,000 files. This results in a .command.run for that process that is 22MB in size! This runs fine locally, but slurm chokes on a script that big (their limit is apparently 4MB). Ideally I'd just store all of these files in a directory somewhere and pass the directory around, but I'm not quite sure what the best way to do that would be. If I pass all of the files into a process and return a directory then I have the same problem, but if I store things from a channel (maybe using reduce) then I'm not sure where to put the directory... Any ideas?
Félix C. Morency
@fmorency
Aug 03 2017 18:20
@mes5k can't you use like.. wildcards? are you passing the whole file list to the .command.run?
Paolo Di Tommaso
@pditommaso
Aug 03 2017 18:22
OH !! at each new fix you push it to a new limit ! :)
what does it contain? 100k symlink creations ?!
Mike Smoot
@mes5k
Aug 03 2017 18:23
Yes
Paolo Di Tommaso
@pditommaso
Aug 03 2017 18:24
hence 100k upstream tasks ?
Mike Smoot
@mes5k
Aug 03 2017 18:25
Probably a bit less, but some tasks produce many output files.
But yes, this is a big pipeline.
Paolo Di Tommaso
@pditommaso
Aug 03 2017 18:26
so output the directories instead of each single file
just posting if somebody is interested ..
Mike Smoot
@mes5k
Aug 03 2017 18:31
Yeah, some sort of upstream aggregation is probably the answer, but it's not immediately clear that we can do that. I'll see what I can do. First I think I'll try reduce that returns a dir...
Paolo Di Tommaso
@pditommaso
Aug 03 2017 18:32
you said you have process that outputs many files
can you not create a folder in the process, mv the files there, then output it ?
Mike Smoot
@mes5k
Aug 03 2017 18:33
I do, but a lot of other processing happens to those files, so a change there would ripple through a lot of the DAG.
Paolo Di Tommaso
@pditommaso
Aug 03 2017 18:33
I see, they are text files or binary files ?
Mike Smoot
@mes5k
Aug 03 2017 18:34
The process that generates the files is also already part of a complicated batching operation because one process per file is WAY too much overhead, even though the code was much cleaner.
They're fasta and gff files
Paolo Di Tommaso
@pditommaso
Aug 03 2017 18:36
what about collecting all the paths to a single text file, then processing them by reading that file ?
provided you don't have another way to aggregate those fasta and gff somehow
Mike Smoot
@mes5k
Aug 03 2017 18:43
I'm actually already doing that! The python script that was processing all of these files couldn't handle that much input on the command line so it now reads a YAML file with my long list of files. However, what I'm doing is just listing the file name instead of full path. I think I should be able to list the full path instead... Maybe that's it!
Paolo Di Tommaso
@pditommaso
Aug 03 2017 18:46
if you are collecting the files into a yaml why you have such big wrapper script? the input should be only one..
Mike Smoot
@mes5k
Aug 03 2017 18:50
I was just writing the file names into the YAML and not the full paths and then passing the files in like normal. Probably not a great idea... :)
Paolo Di Tommaso
@pditommaso
Aug 03 2017 18:50
:)
Michael Halagan
@mhalagan-nmdp
Aug 03 2017 21:00

I was having issues with running a nextflow cloud cluster, but I got it working after @pditommaso mentioned the need for an EFS.

Here are the four things I did to get a nextflow cloud cluster working:

  • Used copied version of nextflow AMI from EU Ireland
  • Made sure the AWS VPC has the "DCHP Option Sets" being resolved to "ec2-internal".
  • Set up an AWS EFS and provided the ID as the a sharedStorageId parameter in the nextflow.config file.
  • Made sure the security group allows the master node to access the child nodes and the EFS specified.

I think this functionality makes nextflow extremely useful.

@pditommaso When creating a nextflow cluster (nextflow cloud create clustername -c 3), is there a way to programmatically answer 'y'?
Please confirm you really want to launch the cluster with above configuration [y/n]
Jeff Jasper
@_jasper1918_twitter
Aug 03 2017 21:22
very helpful @mhalagan-nmdp . Once I got my cluster scheme well defined maybe we can compare notes.