These are chat archives for nextflow-io/nextflow

14th
Feb 2019
Paolo Di Tommaso
@pditommaso
Feb 14 06:49 UTC
@hydriniumh2 regarding checksum have a look also at this tweet https://twitter.com/delagoya/status/1095731459929718785
therefore I would say that also downloads are safe
rfenouil
@rfenouil
Feb 14 08:33 UTC
Good morning, I cannot find the way to attribute several elements to a param.something variable from command line using --something=valueA valueB (and get an array).
Single-value works. Is that possible ?
Maxime Garcia
@MaxUlysse
Feb 14 08:35 UTC
@rfenouil You would want to so domething like --something foo,bar ?
rfenouil
@rfenouil
Feb 14 08:35 UTC
Oh the comma is what I need ?
Thank you I'll try right away :)
Maxime Garcia
@MaxUlysse
Feb 14 08:35 UTC
not only
you'll need to trim your params in your script as well
something = params.something ? params.something.split(',').collect{it.trim()} : []
that way if it's only one, you'll get one, if it's more you'll get an array
rfenouil
@rfenouil
Feb 14 08:37 UTC
owww... So NF does not handle multiple argument natively ?
Maxime Garcia
@MaxUlysse
Feb 14 08:38 UTC
Not sure
NF can handles glob within argument
but I don't think it can natively do with multiple argument
But I can totally be wrong
rfenouil
@rfenouil
Feb 14 08:40 UTC
Yep globs are nice for files but not for arbitrary values.
Because of native globs handling, I actually assumed that it would be possible to provide several values to 'regular' arguments.
Thank you very much Maxime I'll go with your suggestion.
Maxime Garcia
@MaxUlysse
Feb 14 08:41 UTC
If there's something smarter, I'm sure Paolo will correct me soon enough ;-)
rfenouil
@rfenouil
Feb 14 08:41 UTC
I'll keep an eye on my gitter notifications, thank you ;)
Toni Hermoso Pulido
@toniher
Feb 14 09:46 UTC
Hello, I'm trying to retrieve the full parent directory from a file, normally not a problem in most setups
db_name = file(params.blastDB_path).name
db_path = file(params.blastDB_path).parent
however in one computer, I guess because of symbolic links, I get only part of the full path, let's say, from /nfs/db/ncbi/201810/blastdb/db/nr, I don't get /nfs/db/ncbi/201810/blastdb/db , but db
any idea?
Toni Hermoso Pulido
@toniher
Feb 14 09:54 UTC
solved, I was messing up with some input below
rfenouil
@rfenouil
Feb 14 09:55 UTC
Auto-resolving issues are the best :)
Maxime Garcia
@MaxUlysse
Feb 14 09:59 UTC
:bird:
rfenouil
@rfenouil
Feb 14 10:02 UTC
Quick question again:
I have a process that requires one file as regular input, and optionaly a second one in a separate argument.
For the optional file, when the user does not give one I remember @pditommaso recommended to put a 'dummy file object' in the channel (to maintain execution of the process).
That required to check for file value in the process block and replace argument consequently.
I wonder if something less convoluted is now possible with more recent version of NF ?
rfenouil
@rfenouil
Feb 14 10:13 UTC
Here is (roughly) how I handle it currently :
NO_FILE = file('DUMMY_FILE')

// Default value
params.optionalFile=false


// Initialize channel with DUMMY object for optional file
Channel.from(NO_FILE).set{ ch_optionalFile_forDE }

// Override channel value if optional file specified
if(params.optionalFile) Channel.fromPath(params.optionalFile).set{ ch_optionalFile_forDE }

process DE {

    input:
        file inputFile        from ch_inputFile_forDE // don't care about this one in this example
        file optionalFile     from ch_optionalFile_forDE.collect()

    output:
        ...

    script:
        """
        { optionalFile = optionalFile==NO_FILE ? "NULL" : optionalFile; } // that is broken (can't remember by heart) but you get the idea
        DE.R --input $inputFile \
             --optFile $optionalFile
        """
}
I copied this example from memory so it's super broken but the idea is here
rfenouil
@rfenouil
Feb 14 10:20 UTC
I suspect there is a much better way to do this. Tried to play with empty channels but that systematically prevents the process execution.
I like the 'optional' keyword that can be added to output lines, would it make sense to implement it for input lines too ?
Maxime Garcia
@MaxUlysse
Feb 14 10:23 UTC
I think the issue with adding that for input is that it would complicate the DAG even more
In your case, I would try to simplify the script with something like:
// Default value
params.optionalFile=false

ch_optionalFile_forDE = params.optionalFile ? Channel.fromPath(params.optionalFile) : Channel.from("NO_FILE")
process DE {

    input:
        file inputFile        from ch_inputFile_forDE
        file optionalFile     from ch_optionalFile_forDE

    output:

    script:
    optionalFile = optionalFile == NO_FILE ? "NULL" : optionalFile
        """
        echo --input $inputFile \
             --optFile $optionalFile
        """
}
which is of course, much nicer than mine
rfenouil
@rfenouil
Feb 14 10:28 UTC
Yes thank you for your corrections/optimizations
Meh I looked into patterns but missed it... Coffee required :)
So yeah I guess the answer is that we still need to use a dummy object as 'flag' in the channel.
Maxime Garcia
@MaxUlysse
Feb 14 10:30 UTC
Yes, I'm afraid so
rfenouil
@rfenouil
Feb 14 10:31 UTC
Ok thank you again Maxime, you saved my day :)
Maxime Garcia
@MaxUlysse
Feb 14 10:32 UTC
don't mention it
rfenouil
@rfenouil
Feb 14 10:33 UTC
Next time you see Paolo, can you pay him a beer and suggest the optional keyword for inputs ? I refund you two beers :D
Maxime Garcia
@MaxUlysse
Feb 14 10:33 UTC
lool
Mathias Walzer
@mwalzer
Feb 14 11:32 UTC
Hi there!

I am having trouble connecting a non-aws s3 bucket as publishDir. I use publishDir 's3://mybucket/' and have a client endpoint configured:

profiles {
   myprofile {
       docker.enabled = false
       singularity.enabled = true
       process.executor = 'slurm'
       process.containerOptions = '-B /mnt/gluster:/mnt/gluster'
       aws {
           accessKey = 'tercesrepus'
           secretKey = 'supersecret'
           client.endpoint = 'https://s3.myendpoint.ac.uk'
           protocol = 'HTTPS'
       }
  } 
}

It is a S3 compatible backend in an OpenStack private cloud. I can use mybucket regularly with python boto3 (where my client would look something like this:

s3_client = session.client(
     service_name='s3',
     aws_access_key_id = 'tercesrepus',
     aws_secret_access_key = 'supersecret',
     endpoint_url='https://s3.myendpoint.ac.uk',
)
credentials redacted
;)

But my nextflow workflow fails, log shows:

...
Feb-14 10:59:44.334 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 70; id: 3; name: openmsFileInfo (3); status: COMPLETED; exit: 0; error: -; workDir: /mnt/gluster/nf/work/bd/6c8b64d97fbb911ec5b4ecb2e87eb5 started: 1550141984309; exited: 2019-02-14T10:59:44.179005Z; ]
Feb-14 10:59:44.472 [Task monitor] DEBUG nextflow.file.FileHelper - Creating a file system instance for provider: S3FileSystemProvider
Feb-14 10:59:44.486 [Task monitor] DEBUG nextflow.Global - Using AWS credentials defined in nextflow config file
Feb-14 10:59:44.501 [Task monitor] DEBUG nextflow.file.FileHelper - AWS S3 config details: {secret_key=supers.., endpoint=https://s3.myendpoint.ac.uk, access_key=terce..}
Feb-14 10:59:46.218 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'openmsFileInfo (3)'

Caused by:
  mybucket.s3.myendpoint.ac.uk

com.amazonaws.SdkClientException: Unable to execute HTTP request: mybucket.s3.myendpoint.ac.uk
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1163)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1109)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:758)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:732)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:714)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:674)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:656)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:520)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4705)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4652)
    ...

Which cant work, since there is no mybucket.s3.myendpoint.ac.uk, only s3.myendpoint.ac.uk.
Anyone know how I can change the AmazonS3Client?

Mathias Walzer
@mwalzer
Feb 14 11:40 UTC
might be related to #797
though that error is a timeout, but also the s3 access is via s3file = file('...
Alexey Dushen
@blacky0x0
Feb 14 12:02 UTC
Is there a way to use sub-pipelines / sub-modules / functions in Nextflow now?
Maxime Garcia
@MaxUlysse
Feb 14 12:03 UTC
@blacky0x0 I believe that the submodules is still in development, but you can definitively try it out
Alexey Dushen
@blacky0x0
Feb 14 12:06 UTC
aha, I see, thx a lot
Paolo Di Tommaso
@pditommaso
Feb 14 12:13 UTC
@blacky0x0 :point_up: February 11, 2019 3:59 PM
Alexey Dushen
@blacky0x0
Feb 14 12:14 UTC
yep, I saw it, sorry
btw, what happened at the end of 2018th? the version became 18.10.x after 0.32.x.
Maxime Garcia
@MaxUlysse
Feb 14 12:15 UTC
it's all in the blog
Look at the New release schema part
Anthony Ferrari
@af8
Feb 14 13:58 UTC

Hi guys, 2 quick questions. Let's suppose the nextflow session have to handle thousands of jobs :

  1. When a job fails and is submitted again because of maxRetries >0, is it re-submitted instantaneously or is it queued at the end of the list ?

  2. Do we have an idea of how many jobs can be handled in the same Nextflow session ? 100000, 1 million, no limit ;-) ?

Maxime Garcia
@MaxUlysse
Feb 14 13:59 UTC
1/ no idea
2/ I would say no limit ;-) NF don't have to compute the DAG prior to execution
I did manage to make an endless loop
Anthony Ferrari
@af8
Feb 14 14:07 UTC
thanks Max
Maxime Garcia
@MaxUlysse
Feb 14 14:10 UTC
np
Paolo Di Tommaso
@pditommaso
Feb 14 14:34 UTC
there isn't an intrinsic limit
Maxime Garcia
@MaxUlysse
Feb 14 14:36 UTC
I'm guessing it'll break at some point if too much space is occupied
Paolo Di Tommaso
@pditommaso
Feb 14 14:36 UTC
I guess the same :)
well the task ID is an integer, therefore up to 2 bln is ok, hope should be enough
Maxime Garcia
@MaxUlysse
Feb 14 14:37 UTC
I'm glad to see I haven't said that many wrong things today
Paolo Di Tommaso
@pditommaso
Feb 14 14:37 UTC
ahah
Anthony Ferrari
@af8
Feb 14 14:46 UTC
Do you have an idea on the first question @pditommaso ?
Philip Jonsson
@kpjonsson
Feb 14 15:46 UTC
@pditommaso From this nextflow-io/nextflow#124, I gather that Nextflow always presumes the memory base unit for LSF is in MBs. Is that true? If so, a parameter to change that would be good. Our LSF configuration uses GBs as base unit for bsub commands.
Stephen Kelly
@stevekm
Feb 14 17:13 UTC

@rfenouil in regards to having optional input files, I actually just finished figuring out how to handle a similar situation myself and posted it here: https://github.com/stevekm/nextflow-demos/tree/master/variable-input-files

In my case I have a list of Sample ID's and have variable numbers of files that may have been generated for each sample, which I wanted to collect and then pass on to a final process per-sample for reporting. It uses the "dummy file" idea also. Maybe it could help

I think having "optional" input items for a process might be a bad idea because it breaks the cadinality of the input channels, etc. Not sure. I've found that instead of trying to embed such logic into the process, it is better to instead implement the logic in the Channels such that the items that get passed to the process have a consistent cardinality.
and it appears that you can pass an empty list as an input like this:
input:
file(my_items: "*") from Channel.from([])
or something to that effect, example shown in the github link there
Stephen Kelly
@stevekm
Feb 14 17:24 UTC

I have a question, I want to have a mapping of output file suffixes that I want to use throughout the pipeline, however I would prefer to generate the map from within the nextflow.config under the process config. Maybe something like this:

// nextflow.config

params.suffix_map = [:]

process{
    withName: foo {
    params.suffix_map['foo'] = 'foo.txt'
    }
    withName: bar {
    params.suffix_map['bar'] = 'bar.txt'
    }
}

then in my Nextflow script I can have something like

process foo {

input:
set val(SampleID), file(input_file) from some_channel

output:
file("${output_file}")

script:
output_file = "${SampleID}.${params.suffix_map['foo']}"
"""
echo "${SampleID}" > "${output_file}"
"""
}

It seems that embedding functions in the nextflow.config is a bad idea, and the config complains if there are unknown directives and such. So my question is, would it be a bad idea to try to do use this kind of method to set the process suffix values in a map in the config?

Anthony Underwood
@aunderwo
Feb 14 18:44 UTC

Hi. Please can I ask for quick clarification on unexpected behaviour.

I want to sort a files before processing them so I have an operator as follows

scaffolds_for_combined_analysis.toSortedList( { a, b -> a.getBaseName() <=> b.getBaseName()} )

The issue I'm finding is that this emits an empty list and causes the process to start even when no items enter the scaffolds_for_combined_analysis channel

scaffolds_for_combined_analysis.view()

returns nothing

scaffolds_for_combined_analysis.toSortedList( { a, b -> a.getBaseName() <=> b.getBaseName()} ).view()

returns []

hydriniumh2
@hydriniumh2
Feb 14 19:10 UTC
@pditommaso I'm not sure about multipart uploads, according to aws documentation they can't verify downloads in most situations. https://docs.aws.amazon.com/cli/latest/topic/s3-faq.html#download
Anthony Underwood
@aunderwo
Feb 14 20:47 UTC
I suppose this is because an empty channel will be converted to an empty list. Would it be best to check if the list was empty and only run a process if it's not
Anthony Underwood
@aunderwo
Feb 14 21:40 UTC
However I find even though the docs say toSortedList converts to a list it is reporting that it's a DataflowVariable
Stephen Kelly
@stevekm
Feb 14 22:35 UTC
is there a way to tell Nextflow to only run my pipeline until a certain process finishes all tasks?