These are chat archives for nextflow-io/nextflow

14th
Mar 2016
Robert Syme
@robsyme
Mar 14 2016 04:24
@pditommaso I've just tried the Apache Ignite executor - amazing! I didn't expect that it would honour the memory and cpus directives. So nice. Thanks!
Paolo Di Tommaso
@pditommaso
Mar 14 2016 07:23
@robsyme Still not perfect, but it does!
The next release will include a work stealing strategy that should improve the job allocation for some worklods. Stay tuned.
Robert Syme
@robsyme
Mar 14 2016 07:29
I've spun up a test cluster of four nodes (three workers, one master) with a NFS share at /data. Given the workflow:
numbers = Channel.from(1..10)

process test {
  input:
  val(number) from numbers

  output:
  stdout into debug

  """
sleep 10
echo -n 'Hi there,' ${number}
"""
}

debug.println()
I would expect that three jobs run at the same time, but the timeline looks like:
timeline.html
Paolo Di Tommaso
@pditommaso
Mar 14 2016 07:31
um, curious
can you upload the .nextflow.log somewhere?
Robert Syme
@robsyme
Mar 14 2016 07:33
ubuntu@server-bcc04a54-d969-4d30-bed9-9d13fe49763a:/data$ tree -a cluster/
cluster/
├── 0:0:0:0:0:0:0:1%lo#47500
├── 127.0.0.1#47500
├── 130.56.252.226#47500
├── 130.56.252.227#47500
└── 130.56.252.23#47500
Log
Paolo Di Tommaso
@pditommaso
Mar 14 2016 07:35
It looks that the nodes are not able to discover each other
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=c44de6ee, name=nextflow]
    ^-- H/N/C [hosts=1, nodes=1, CPUs=1]
    ^-- CPU [cur=0.67%, avg=1.98%, GC=0%]
    ^-- Heap [used=147MB, free=84.57%, comm=223MB]
    ^-- Public thread pool [active=1, idle=15, qSize=0]
    ^-- System thread pool [active=0, idle=16, qSize=0]
    ^-- Outbound messages queue [size=0]
the third line says hosts=1, nodes=1, CPUs=1
Robert Syme
@robsyme
Mar 14 2016 07:37
Ah, so I need to open up ports between the master and each of the workers?
Paolo Di Tommaso
@pditommaso
Mar 14 2016 07:37
if you are using the default discovery mechanism based on tcp multicast yes
what is the cluster folder you posted above?
Robert Syme
@robsyme
Mar 14 2016 07:40
That's the folder given to the -cluster.join path: argument.
Wait, I think I know what I did wrong...
Paolo Di Tommaso
@pditommaso
Mar 14 2016 07:41
therefore port 47500 should be open in any case
Robert Syme
@robsyme
Mar 14 2016 07:41
Just checking...
Paolo Di Tommaso
@pditommaso
Mar 14 2016 07:46
anyway in your command line there's no the cluster join option nextflow run -process.executor ignite -with-timeline timeline.html .
Robert Syme
@robsyme
Mar 14 2016 07:47
That's better better
Paolo Di Tommaso
@pditommaso
Mar 14 2016 07:47
yes!
Robert Syme
@robsyme
Mar 14 2016 07:47
Two things:
1) I forgot to add the cluster join option after switching from multicast to shared file system method
2) The write permissions to the shared file system were screwy
Thanks!
Paolo Di Tommaso
@pditommaso
Mar 14 2016 07:48
ah
welcome!
Jason Byars
@jbyars
Mar 14 2016 18:27
does sufficient status/log data exist while a run is running to generate the execution timeline graphic in ~realtime?
Paolo Di Tommaso
@pditommaso
Mar 14 2016 19:03
@jbyars Yes. The main reason why is rending it when the pipeline complete is due to the hierarchical nature of html. But it could be rendered in realtime using a more sophisticated approach .
Jason Byars
@jbyars
Mar 14 2016 19:21
Thank you. Would there be any objection to creating a Jenkins plugin to do?
Jason Byars
@jbyars
Mar 14 2016 19:33
To do a simple realtime plot of a pipeline while it is running?
Paolo Di Tommaso
@pditommaso
Mar 14 2016 19:41
Not at all, any contribution is very welcomed
Actually I'm also interested in your use case. How are you using nextflow with Jenkins ?
Jason Byars
@jbyars
Mar 14 2016 19:42
Jenkins is my go to for building code, and to me most bioinformatics problems are just a code build with a few TB of data thrown in the mix.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 19:44
so, are you using it to launch your pipelines ?
Jason Byars
@jbyars
Mar 14 2016 19:44
yes, launch pipelines, do house keeping, etc.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 19:45
I see, nice
Jason Byars
@jbyars
Mar 14 2016 19:47
it allows reusable scripts, pipeline chaining, and a plethora of plugins to keep the amount of scripting down.
However, nextflow provides a much more appropriate DSL to describe my work than the Jenkins Workflow DSL.
Jason Byars
@jbyars
Mar 14 2016 19:55
switching jobs schedulers would be painful and heredoc wrapping pbs jobs doesn't create the most readable job scripts
Paolo Di Tommaso
@pditommaso
Mar 14 2016 19:56
Indeed, Jenkins is fine for build workflows but as you said is not the right tool for computational pipelines
Jason Byars
@jbyars
Mar 14 2016 20:02
exactly, but if you leave the house keeping to Jenkins and describe the pipelines with nextflow you have something fairly nice to use. The two things to keep in mind with my use case is we are doing IonTorrent sequencing, so samples tend to trickle in over time rather than all at once, so a pipeline that kicks off whenever new data shows up is attractive. For most of the processing we do, I'm just passing file names between processes.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:05
Interesting, if you could blog about that would be great
Jason Byars
@jbyars
Mar 14 2016 20:08
I'm trying to share an example pipeline. How are people getting images to paste? I keep getting a problem connecting to upload server error
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:08
what images? what server?
Jason Byars
@jbyars
Mar 14 2016 20:09
I'm trying to paste an image in the chat.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:10
Drag & drop should work
Jason Byars
@jbyars
Mar 14 2016 20:10
yes it probably should, but doesn't at the moment.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:10
:)
Anyway you should also give a try to watchPath
it allows you to process data in a continuous manner as it is generated
Jason Byars
@jbyars
Mar 14 2016 20:12
pipeline.png
ahh apparently the answer was firefox
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:13
What Jenkins plugin is this?
Jason Byars
@jbyars
Mar 14 2016 20:14
I still haven't worked through the Channel Factory material, but you're right, watchPath probably is a good fit. That's an example of the older Jenkins build pipeline plugin. This will be a little confusing Jenkins recently renamed all of their Workflow DSL plugins pipeline as well.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:16
I'm not updated with latest Jenkins developments :)
Jason Byars
@jbyars
Mar 14 2016 20:18
The visualizations of their Workflow DSL are done with the recently released Stage View plugin. It really doesn't deal with branching another other concepts that we might use.
You haven't missed much. My take is right UI concepts, but wrong DSL for bioinformatics jobs. Maybe with a small amount of adaptation I can fix that.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:23
That would be great, however proper workflows visualisation could be a really tricky task
Jason Byars
@jbyars
Mar 14 2016 20:24
The cross over for watchPath is the FSTrigger plugin.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:26
yep
Jason Byars
@jbyars
Mar 14 2016 20:29
the catch with the FSTrigger plugin is you have to be careful it doesn't inspect larger files. Otherwise your server will just sit around md5 hashing your bam files all day. Does watchPath look at anything beyond the timestamp?
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:31
watchPatch relies over the Java file system watch service
there's not such a problem
Jason Byars
@jbyars
Mar 14 2016 20:32
I agree workflow visualization is hard. As soon as you get past simple examples, the diagrams usually turn into the flying spaghetti monster.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:32
even though I don't know how it works behind the scene
As soon as you get past simple examples, the diagrams usually turn into the flying spaghetti monster.
that's the problem
to make it really functional for big workflows requires a smart approach and a lot of work
Jason Byars
@jbyars
Mar 14 2016 20:34
the only thing I found that helps over the years is forcing people to provide some context so I can decided what is priority and what can be simplified.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:35
make sense
Jason Byars
@jbyars
Mar 14 2016 20:37
for my needs for example. I usually just need that high level view I pasted. I just want to see what processes are running, which stage blew up, and where are the logs for said stage that blew up.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:38
we have proposed a google idea that more or less would include what you are saying
The only missing part is a student to implement it ;)
Jason Byars
@jbyars
Mar 14 2016 20:38
interesting, are you working with the Arvados guys on this?
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:39
Not directly, but maybe we could in future
Jason Byars
@jbyars
Mar 14 2016 20:44
Interesting. You'll probably want to wrap Aspera's ascp into the DSL to fetch the actual data. Then it would be a simple matter of solving for URL.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:45
The problem with Aspera things is that they have a very restrictive license model
Jason Byars
@jbyars
Mar 14 2016 20:45
So associated with the study id would be one or many nextflow flows, and each of those would have a list of required docker containers available on public hubs?
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:45
Is there any java client for that ?
So associated with the study id would be one or many nextflow flows, and each of those would have a list of required docker containers available on public hubs?
Jason Byars
@jbyars
Mar 14 2016 20:46
Agreed, now that IBM owns Aspera, all licensing is a headache.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:46
It depends by the user.
Nextflow can pull a different container image for each task if needed
New FASPStream API supporting high speed, predictable WAN based transport of byte stream ("live") data via software libraries and a FASPStream binary (Windows, Linux) and optional management APIs for Java and .NET
Jason Byars
@jbyars
Mar 14 2016 20:49
yes, that's the protocol. ascp is their free command line client.
All I'm suggesting is DSL runs the command line client and the nextflow user preinstalls the client.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:53
that could be done, though it would introduce a dependency on the Aspera client, breaking the portability of the pipeline
Jason Byars
@jbyars
Mar 14 2016 20:54
the feature is the ability to specify SRA study or SRA samples and not have to work out the url. Agreed that would be a minor portability break.
If you just want permission to have nextflow download and use the ascp client, that might be a concise enough request, to get a clear answer from IBM's reps. They seem to be very reasonable about academic requests.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 20:58
that sounds reasonable, I will take it in consideration
Jason Byars
@jbyars
Mar 14 2016 20:58
if getting permission becomes a hassle, I completely understand.
the perk is pulling data into AWS with ascp is really fast and really easy... if you can guess the correct url
Paolo Di Tommaso
@pditommaso
Mar 14 2016 21:03
Are you referring the SRA database or a generic data?
Jason Byars
@jbyars
Mar 14 2016 21:26
I'm referring to reads from the SRA database
I haven't had an opportunity to pull data from other Aspera servers.
Paolo Di Tommaso
@pditommaso
Mar 14 2016 21:36
Nextflow could try to use the ascp client if installed or fallback to http otherwise.
Jason Byars
@jbyars
Mar 14 2016 21:39
I think that would be a good approach