These are chat archives for nextflow-io/nextflow

1st
Sep 2016
Mike Smoot
@mes5k
Sep 01 2016 15:16

Hi @pditommaso I've got a simple process that keeps failing because the expected output file is missing:

process gzipem {
    publishDir path: "results", mode: "copy", overwrite: true

    input:
    file(json) from json_files

    output:
    file("${json}.gz") into gzipped_files

    script:
    """
    gzip -c ${json} > ${json}.gz
    """
}

When I look in the work dir, I see the output file. I thought this might be an NFS issue, but it fails in the same way when run on a local disk. The input files aren't huge, so the process runs pretty quickly. I've tried setting errorStrategy to retry and maxRetries to 2 and separately I've tried adding a sleep 10 after the gzip command and still I get failures. Any other ideas on how I might address this?

Paolo Di Tommaso
@pditommaso
Sep 01 2016 15:18
the (posix) process ends regularly ?
Mike Smoot
@mes5k
Sep 01 2016 15:21
yes, exit 0
Paolo Di Tommaso
@pditommaso
Sep 01 2016 15:22
didn't make much sense
are u using docker with this?
Mike Smoot
@mes5k
Sep 01 2016 15:23
No, although the config file I copied in has docker enabled. Let me turn that off an try again.
didn't seem to help
Paolo Di Tommaso
@pditommaso
Sep 01 2016 15:26
have u tried to touch a file after the gzip a capturing it instead of the other ?
Mike Smoot
@mes5k
Sep 01 2016 15:27
do you mean gzip -c ${json} > ${json}.gz && touch ${json}.gz?
I can try
Paolo Di Tommaso
@pditommaso
Sep 01 2016 15:27
just another file name
Mike Smoot
@mes5k
Sep 01 2016 15:30
touching either json.gz or another file name didn't seem to help. One thing I'm noticing is that for each failure, the file name in question has square brackets in it. For instance, this is one of the input files:
PUBLIC__Microbial__Viruses__Satellites__Satellite_Nucleic_Acids__Single_stranded_DNA_satellites__Betasatellites__Siegesbeckia_yellow_vein_virus-[GD13]-associated_DNA_beta__AM230643.1.json
Paolo Di Tommaso
@pditommaso
Sep 01 2016 15:31
ummmmm, I don't like that square brackets in the file name
Mike Smoot
@mes5k
Sep 01 2016 15:31
Yeah, every failure I've noticed seems to have those.
Let me see if fixing that addresses my problem.
Paolo Di Tommaso
@pditommaso
Sep 01 2016 15:31
can you renamed it file1.gz just to try
Mike Smoot
@mes5k
Sep 01 2016 15:33
sure
Well, it seemed to be the filename!
Paolo Di Tommaso
@pditommaso
Sep 01 2016 15:35
eh
Mike Smoot
@mes5k
Sep 01 2016 15:35
I can now run to completion.
Paolo Di Tommaso
@pditommaso
Sep 01 2016 15:35
:)
you may want to open an issue for that .. but that names ....
:)
Mike Smoot
@mes5k
Sep 01 2016 15:37
Sounds good. I'll open a ticket. The names are generated, but I can certainly do a better job tidying them up!
Paolo Di Tommaso
@pditommaso
Sep 01 2016 15:47
Announcing a new nextflow milestone
This message was deleted
Evan Floden
@evanfloden
Sep 01 2016 15:52
🍾🍾🍾🍾🍾
Mike Smoot
@mes5k
Sep 01 2016 15:55
Very, very cool!
Paolo Di Tommaso
@pditommaso
Sep 01 2016 15:56
Thanks!
BTW the way it includes also the new log and clean commands
Mike Smoot
@mes5k
Sep 01 2016 15:57
Yeah, big release! I hope to experiment with this shortly.
Paolo Di Tommaso
@pditommaso
Sep 01 2016 15:57
Looking forward to hearing your comments
Mike Smoot
@mes5k
Sep 01 2016 17:28
Hi @pditommaso is there support for the queue directive with Apache Ignite? Can I create an ignite cluster with different machines dedicated to different tasks?
Paolo Di Tommaso
@pditommaso
Sep 01 2016 17:29
not there no queue concept with ignite, however you can keep each cluster instance separate
is it a cloud cluster or on-premises ?
Mike Smoot
@mes5k
Sep 01 2016 17:32
Can I have one nextflow pipeline span multiple cluster instances? Initially this is just a pretend "cluster" on some local machines to test things out, but ultimately in the cloud. Maybe even using nextflow cloud :)
Paolo Di Tommaso
@pditommaso
Sep 01 2016 17:33
no, each nextflow run spawn its own cluster instance
is this fine ?
Mike Smoot
@mes5k
Sep 01 2016 17:34
What I want are all my blast tasks to run in on a particular machine tuned for running blast and other tasks to be run elsewhere.
Paolo Di Tommaso
@pditommaso
Sep 01 2016 17:35
ahh
Mike Smoot
@mes5k
Sep 01 2016 17:35
I assume this would work with SLURM or PBS or some such using the queue directive.
Paolo Di Tommaso
@pditommaso
Sep 01 2016 17:36
currently you can only bind the execution to some computing resources i.e. cpus, mem and disk
the problem with the queue, is that it's concept that requires a persistent allocation of resources
instead nextflow spawn a transient cluster
in the cloud could have sense to tag some instances and then specify that tags in the process requirements
Mike Smoot
@mes5k
Sep 01 2016 17:42
That's basically what I'd been thinking. I was assuming that ignite's "groups" could be used direct tasks to specific places.
Paolo Di Tommaso
@pditommaso
Sep 01 2016 17:45
the problem is that ignite daemon needs an exclusive access to the node
otherwise you will have a conflict in the resource allocation
Mike Smoot
@mes5k
Sep 01 2016 17:47
Reading a bit here: http://apacheignite.gridgain.org/v1.1/docs/cluster-groups I understand that ignite isn't configured by default to do what I've been imagining, but I wonder if there's a way to use tags to accomplish this?
Paolo Di Tommaso
@pditommaso
Sep 01 2016 17:49
sounds interesting, the group could be mapped as a queue
defining it in the config file
I mean, defining what host(s) made-up the group
Mike Smoot
@mes5k
Sep 01 2016 17:51
exactly!
Paolo Di Tommaso
@pditommaso
Sep 01 2016 17:51
nice idea
Evan Floden
@evanfloden
Sep 01 2016 20:05

Nextflow operator question. I have some unexpected behaviour using phase(). In the following code snipet (run with nextflow console), I would expect phase to return 3 values, however it seems one is missing.

Code snipet

Channel
    .from(['datasetA', '10', 'random', "alignmentX 1\nalingmentX 2\nalignmentK 1\nalignmentK 1"], ['datasetB', '5', 'distant', "alignmentK 1\nalingmentK 2\nalignmentK 3"])
    .set{requiredStrapTrees}

println("splitPhylips:") 
Channel
    .from(['datasetA', 'alignmentX', '1', 'fileX'], ['datasetA', 'alignmentX', '2', 'fileC'], ['datasetB', 'alignmentK', '1', 'fileN'] )
    .view()
    .set { splitPhylips }

def splitListFile(file) {
    file.readLines().findAll {it}.collect {line -> line.tokenize(' ')}
}   

println("uniqueRequiredStrapTrees:")
requiredStrapTrees
  .map { set ->
    def datasetID = set[0]
    def file = set[3]
    splitListFile(file).collect { item -> tuple(datasetID, item[0], item[1]) }
  } 
  .flatMap {item -> item }
  .unique()
  .view()
  .set{uniqueRequiredStrapTrees}

uniqueRequiredStrapTrees
  .phase(splitPhylips) {item -> [item[0],item[1],item[2]] }
  .view()
  .map {item -> [item[0][0], item[0][1], item[0][2], item[1][3]]}

returns:

splitPhylips
[datasetA, alignmentX, 1, fileX]
[datasetA, alignmentX, 2, fileC]
[datasetB, alignmentK, 1, fileN]
uniqueRequiredStrapTrees
[datasetA, alignmentX, 1]
[datasetA, alingmentX, 2]
[datasetA, alignmentK, 1]
[datasetB, alignmentK, 1]
[datasetB, alingmentK, 2]
[datasetB, alignmentK, 3]
[[datasetA, alignmentX, 1], [datasetA, alignmentX, 1, fileX]]
[[datasetB, alignmentK, 1], [datasetB, alignmentK, 1, fileN]]

I would expect [[datasetA, alignmentX, 2], [datasetA, alignmentX, 2, fileC]] to also be emitted.

Paolo Di Tommaso
@pditommaso
Sep 01 2016 20:06
(a few secs)
complex stuff man !
Evan Floden
@evanfloden
Sep 01 2016 20:11
From the documentation, I would think that in defining .phase(splitPhylips) {item -> [item[0],item[1],item[2]] } the syncronisation occurs using these 3 elements between each item in the target and source channel
Paolo Di Tommaso
@pditommaso
Sep 01 2016 20:13
I think the problem is that splitPhylips produces less items than uniqueRequiredStrapTrees
Evan Floden
@evanfloden
Sep 01 2016 20:14
Yep
Paolo Di Tommaso
@pditommaso
Sep 01 2016 20:15
phase expects a one-to-one mapping between items in two different channels
makes sense ?
Evan Floden
@evanfloden
Sep 01 2016 20:16
but interestingly, cross gives the same result
sorry retract that. So should I put in a dummy element so they are 4 v 4 elements?
Paolo Di Tommaso
@pditommaso
Sep 01 2016 20:19
um, it looks that uniqueRequiredStrapTrees have more than one time the same key
datasetA, datasetA, datasetB, datasetB, etc
right?
Evan Floden
@evanfloden
Sep 01 2016 20:20
No, that my point
when I define phase(splitPhylips) {item -> [item[0],item[1],item[2]] } , the key becomes [item[0],item[1],item[2]]
but obviously not ;)
I understood that the key is defined as such
Paolo Di Tommaso
@pditommaso
Sep 01 2016 20:22
not getting your point
Evan Floden
@evanfloden
Sep 01 2016 20:25
Sorry, I thought the match was performed using [item[0],item[1],item[2]]. It is, but [item[0],item[1],item[2]] must be the first element of the target
Not simply [item[0],item[1],item[2]] in the target channel, but the first element must equal [item[0],item[1],item[2]]
I fixed it by changing the definition of splitPhylips, to make the first element the same as the key
Channel
    .from(['datasetA', 'alignmentX', '1', 'fileX'], ['datasetA', 'alignmentX', '2', 'fileC'], ['datasetB', 'alignmentK', '1', 'fileN'] )
    .map {item -> [[item[0],item[1],item[2]], item[0], item[1],item[2]]}
    .view()
    .set { splitPhylips }
Talking to the teddy bear! Getting toooo deep. Thanks mate
Paolo Di Tommaso
@pditommaso
Sep 01 2016 20:28
great, you are going like a pro!
:)