These are chat archives for nextflow-io/nextflow

28th
Mar 2019
Rad Suchecki
@rsuchecki
Mar 28 00:21
@lebernstein Assuming original input file not something generated by the pipeline:
file1 = file('input.txt')

process foo {
echo true
input:
  file(file2) from Channel.fromPath(file1.text.trim())
Tobias "Tobi" Schraink
@tobsecret
Mar 28 01:36
@lebernstein does file fileobj from file1.splitCsv().map { $it -> file($it) } do the trick?
Or set val filename, file fileobj from file1.splitCsv().map { $it -> [file($it), $it] }
Laurence E. Bernstein
@lebernstein
Mar 28 07:22
@rsuchecki File1 is a channel with a file name in it, because I found the name using some python script in another process and wrote the name to a file since I can't save it to a value.
@rsuchecki, @tobsecret None of those solutions seem to work. They all give syntactical errors of some sort.
Although I see what @tobsecret is trying to do and it kinda makes sense.
Rad Suchecki
@rsuchecki
Mar 28 09:15

If the file name is identified in another process then it has to be either

  • a full path
    or
  • staged out from that process

otherwise the downstream process has no way of finding it

Carl Witt
@carlwitt
Mar 28 09:20
I put together a construct like this yesterday: It downloads files from an ftp address to the workflow directory (if not already present) and then outputs maps that contain file() entries pointing to the local files. This works for me, and you should be able to remove the download stuff and directly output file() objects to a channel, causing them to be properly staged.
/*
benchmark-configurations.csv:

run_accession,input_ftp_1,input_ftp_2,initial_memory_limit_gb
SRR1930149,ftp://ftp.sra.ebi.ac.uk/.../SRR1930149_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/.../SRR1930149_2.fastq.gz,20,
*/

configurations = Channel
    .from( file('benchmark-configurations.csv').text )
    .splitCsv(header: true)

process download_if_missing {

    input:
    val experiment from configurations

    output:
    val experiment into local_configurations

    exec:
    // for each input file
    for(index in ['input_ftp_1', 'input_ftp_2']) {

        // create local counterparts for files from their remote (ftp) urls
        ftp = file(experiment[index])
        local = file("$params.input_dir/${ftp.getName()}")    

        if (! local.exists()) {
            println "Download $ftp"
            ftp.copyTo(local)
            println "Done, downloaded ${local.size()} bytes"
        } else {
            println "File ${ftp.getSimpleName()} already present."
        }
        experiment[index] = local
    }
}
Jonathan Manning
@pinin4fjords
Mar 28 09:28
Morning! Quick one- can I synchronise channels by key(s), without doing a join?
Ola Tarkowska
@olatarkowska
Mar 28 09:43
Morning, could I ask how nextflow deal with multiple clusters KUBECONFIG env var ? I am getting ERROR ~ Unable to lookup Kubernetes cluster configuration
Ola Tarkowska
@olatarkowska
Mar 28 10:12
I filed bug in nextflow-io/nextflow#1089
please let me know if you need more info
Vladimir Kiselev
@wikiselev
Mar 28 10:16
how can I download a directory/whole bucket from S3 using NF?
Vladimir Kiselev
@wikiselev
Mar 28 10:52
also, is it possible to get a file from our custom S3 storage which does not have s3://, e.g. Channel.fromPath('https://scrnaseq-course.cog.sanger.ac.uk/data/EXAMPLE.cram')
if not, then I will need to curl or wget manually?
Rad Suchecki
@rsuchecki
Mar 28 10:54
Channel fromPath url should work
Vladimir Kiselev
@wikiselev
Mar 28 10:59
thanks @rsuchecki , looks I was checking the wrong process...
it works
do you know if it is possible do download a directory from S3 or a bucket? downloading one file at a time is not nice
Paolo Di Tommaso
@pditommaso
Mar 28 11:06
I think if you use s3://bucket/* should work
Vladimir Kiselev
@wikiselev
Mar 28 11:16
is s3:// required there? I am getting this: WARN: Unable to stage foreign file: https://scrnaseq-course.cog.sanger.ac.uk/* (try 1) -- Cause: Unable to access path: /*
Rad Suchecki
@rsuchecki
Mar 28 11:18
I only used NF+s3 with AWS Batch and there it was seamless with s3:// but based I think on presence of aws-cli in the AMI
Paolo Di Tommaso
@pditommaso
Mar 28 11:41
show the code snippet please
Vladimir Kiselev
@wikiselev
Mar 28 11:59
Channel
    .fromPath('course_files', type: 'dir')
    .into { ch_course_files1; ch_course_files2 }

Channel
    .fromPath('https://scrnaseq-course.cog.sanger.ac.uk/*')
    .into { ch_data1; ch_data2 }

process html {
  input: 
    file fs from ch_course_files1
    file dat from ch_data1
  script:
  """
  cp -r course_files/* .
  Rscript -e "bookdown::render_book('index.html', 'bookdown::gitbook')"
  “"”
}
Paolo Di Tommaso
@pditommaso
Mar 28 12:30
s3 bucket or files over http?
the s3:// allows glob patterns instead http:// protocol does not
micans
@micans
Mar 28 14:01

(Vlad will be back, in meeting). Another question, we got this error in a pipeline that run for half an hour, then produced this:

Mar-28 13:22:44.756 [Task monitor] ERROR nextflow.processor.TaskProcessor - Execution aborted due to an unexpected error java.lang.IllegalArgumentException: Cannot compare java.lang.Integer with value '1' and groovy.util.ConfigObject with value '[:]' at
org.codehaus.groovy.runtime.typehandling.DefaultTypeTransformation.compareToWithEqualityCheck(DefaultTypeTransformation.java:606) at 
org.codehaus.groovy.runtime.typehandling.DefaultTypeTransformation.compareTo(DefaultTypeTransformation.java:540) at 
org.codehaus.groovy.runtime.ScriptBytecodeAdapter.compareTo(ScriptBytecodeAdapter.java:714) at 
org.codehaus.groovy.runtime.ScriptBytecodeAdapter.compareLessThanEqual(ScriptBytecodeAdapter.java:749)

Could this be an error in a closure in the config, that only gets evaluated on a retry? Or is it something else entirely?

Paolo Di Tommaso
@pditommaso
Mar 28 14:07
some mess with dyn rule and config file Cannot compare java.lang.Integer with value '1' and groovy.util.ConfigObject with value '[:]'
micans
@micans
Mar 28 14:08
yep. haven't found it yet, but will look harder
Tobias "Tobi" Schraink
@tobsecret
Mar 28 14:25
@lebernstein Ooof, sorry it didn't work. I was just riffing - would be great to have an MCVE so I could test more quickly.
micans
@micans
Mar 28 15:06
@pditommaso We found that maxRetries was not set in our config process scope, and setting it seems to have fixed the problem (the dyn rules access process.maxRetries). It's slightly surprising, documentation suggests there is a default value of 1. Is there a subtlety there?
Paolo Di Tommaso
@pditommaso
Mar 28 15:06
code first, everything else after :D
micans
@micans
Mar 28 15:11
Do you mean show me the code, or do you mean there is an order to things and setting maxRetries = <num> is required? :-P
Paolo Di Tommaso
@pditommaso
Mar 28 15:11
yes, sorry .. :)
micans
@micans
Mar 28 15:12
Which one?
Paolo Di Tommaso
@pditommaso
Mar 28 15:12
the first .. (I read too fast sorry .. )
micans
@micans
Mar 28 15:13
Alright ... I'll put it on the list. If the pipeline works other parts of the list will magically appear sadly :-(
Paolo Di Tommaso
@pditommaso
Mar 28 15:14
I see
the dyn rules access process.maxRetries
I'm not understanding this
micans
@micans
Mar 28 15:15
ok, I'll summarise
one sec, hold on
micans
@micans
Mar 28 15:20

simplified, it seems this happened:

process {
  cpus   =  1
  memory =  8.GB
  errorStrategy = 'ignore'

  withName: py_scanorama {
    errorStrategy = { task.exitStatus == 130 && task.attempt <= process.maxRetries ? 'retry' : 'ignore' }
    memory = {  12.GB + 8.GB * (task.attempt - 1) }
  }

This led to failure quoted above. Inserting maxRetries = 2 below errorStrategy seems to have helped. It could make sense, as task.attempt was bound to be 1, and then it complained about comparing integer 1 with groovy.util.ConfigObject with value '[:]'

Paolo Di Tommaso
@pditommaso
Mar 28 15:21
process.maxRetries should be task.maxRetries
micans
@micans
Mar 28 15:21
hehe, thanks.
I try to avoid self-harm now
Paolo Di Tommaso
@pditommaso
Mar 28 15:21
:D
micans
@micans
Mar 28 15:22
:beers:
Paolo Di Tommaso
@pditommaso
Mar 28 15:22
sure thing !
micans
@micans
Mar 28 15:22
:+1:!

I'm still flailing though ...documentation says (https://www.nextflow.io/docs/latest/process.html#maxretries):

process retryIfFail {
    errorStrategy 'retry'
    maxRetries 3

    """
    echo 'do this as that .. '
    """
}

Also our rnaseq pipeline has process.maxRetries -- it works that way, but is it supposed to be task.maxRetries?

Paolo Di Tommaso
@pditommaso
Mar 28 15:28
the actual value of process directives are accessible as task.directiveName
there's no such process.maxRetries attribute only task.maxRetries
self-harming ?
Michael Chimenti
@mchimenti
Mar 28 15:52
A little OT, but what is the reason to prefer gitter to slack?
Paolo Di Tommaso
@pditommaso
Mar 28 15:53
less frills
Michael Chimenti
@mchimenti
Mar 28 15:53
Just integration with github and markdown?
Paolo Di Tommaso
@pditommaso
Mar 28 15:53
exactly
Michael Chimenti
@mchimenti
Mar 28 15:53
ok, thanks. I’m creating a comp-bio gitter community for my university
btw, Paolo, I really love NF. I went from zero to working ATAC-seq pipeline in just a few days
micans
@micans
Mar 28 15:57
Thanks @pditommaso . Self-harm was not very appropriate language, I meant something like banging forehead.
Paolo Di Tommaso
@pditommaso
Mar 28 15:58
I went from zero to working ATAC-seq pipeline in just a few days
The goal is a few mins ;)
Sinisa Ivkovic
@sivkovic
Mar 28 15:58
Hi, we are implementing the pipeline in NF, and as we adding more tools we started getting this issue when running it on AWS Batch https://groups.google.com/forum/#!topic/nextflow/BXA-esz1eJ4. It looks to me that there is still no way to increase these timeouts on AWS side. I also tried reconfiguring aws-cli to limit max_concurrent_requests but it didn't help. I was wandering do you have some ideas how to solve this? Thanks
Michael Chimenti
@mchimenti
Mar 28 15:58
haha, well, an expert could I’m sure… I’m a novice :)
my goal is to banish bcbio from our lab and replace with all NF pipelines
Paolo Di Tommaso
@pditommaso
Mar 28 16:01
we started getting this issue when running it on AWS Batch
Michael Chimenti
@mchimenti
Mar 28 16:01
not that there is anything wrong with bcbio, it’s just not very efficient
Paolo Di Tommaso
@pditommaso
Mar 28 16:01
I would need a more detailed report
not that there is anything wrong with bcbio, it’s just not very efficient
micans
@micans
Mar 28 16:02
(is there a task config scope?)
Paolo Di Tommaso
@pditommaso
Mar 28 16:02
well, it uses a different approach
(is there a task config scope?)
nope, nearly all process directive, can be accessed as task.xxx
micans
@micans
Mar 28 16:03
got it, thanks!
Paolo Di Tommaso
@pditommaso
Mar 28 16:03
:v:
micans
@micans
Mar 28 16:04
although I a tiny voice in me says that this is confusing :-)
Michael Chimenti
@mchimenti
Mar 28 16:04
will anyone from the NF core group be attending GLBIO2019?
Sinisa Ivkovic
@sivkovic
Mar 28 16:05

I would need a more detailed report

What informations do you need?

Paolo Di Tommaso
@pditommaso
Mar 28 16:12
a consistent description of the problem, so that I can try to put pressure on AWS side
will anyone from the NF core group be attending GLBIO2019?
nope, we will be a BOSC in Basel
and NF Camp in September .. ;)
Chelsea Sawyer
@csawye01
Mar 28 17:24
Is it possible to have a value from a channel in the input, output and publishDir within the same process? I am getting the error of No such variable: varName when declared in all three.
Tobias "Tobi" Schraink
@tobsecret
Mar 28 17:31
@csawye01 Should be... I do that usually when I have some sort of id associated with some data files. What's the code that causes this error?
Chelsea Sawyer
@csawye01
Mar 28 17:32
@tobsecret
fqname_fqfile_ch = fastqs_fqc_ch.map { fqFile -> [fqFile.getParent().getName(), fqFile ] }
process fastqc {
    tag "$name"
    module MODULE_FASTQC_DEFAULT
    publishDir path: "${outputDir}/${projectName}/FastQC", mode: 'copy'

    input:
    set val(projectName), file(fqFile) from fqname_fqfile_ch

    output:
    set val(projectName), file "*_fastqc" into fqc_folder_ch
    file "*.html" into fqc_html_ch

    script:
    """
    fastqc --extract ${fqFile}
    """
}
Tobias "Tobi" Schraink
@tobsecret
Mar 28 17:34
Does it work if you put parentheses for the file output?
set val(projectName), file("*_fastqc") into fqc_folder_ch
micans
@micans
Mar 28 17:35
good call ... also, your tag "$name" does not seem to match anything.
but the missing parentheses look ominous
Tobias "Tobi" Schraink
@tobsecret
Mar 28 17:37
Yeah, with the set operator you need parentheses afaik?
micans
@micans
Mar 28 17:37
think so too
Tobias "Tobi" Schraink
@tobsecret
Mar 28 17:42
@csawye01
This is an excerpt from one of my scripts, that works
process download_reads {

  tag "$accession"

  input:
    val accession from accessions 

    output:
    set accession, file('**.fastq.gz') into downloaded_reads
    script:
    """
    some script
    """
}
Laurence E. Bernstein
@lebernstein
Mar 28 18:03
@tobsecret Here's what I want to do (in the simplest form)
process process_1 {

  input:

  output:
    file "name.txt" into name_ch

  script:
  """
    #!/usr/bin/python
    with open("name.txt", 'w') as file1:
      file1.write("file2.txt")
  """
}

// This is what I want to do:
process process_file2 {

  input:
    file file2 from name_ch.splitCsv().map { $it -> file($it) }

  output:

  script:
  """
  """
}
Tobias "Tobi" Schraink
@tobsecret
Mar 28 18:19
Hmmm....
tough cookeh :sweat_smile:
Laurence E. Bernstein
@lebernstein
Mar 28 18:22
It seems like it should be so easy. The basic premise is that I have a filename listed in an XML file (not that it matters) and I want to extract that name and use that file in a process. Would it help if I switched to using Groovy (exec) for the first process? I really could just parse the second file in python also, but I was trying to be cool. :)
micans
@micans
Mar 28 18:41
@lebernstein for what it's worth, this works for me:
process process_1 {
  output: file "name.txt" into name_ch

  script:
  """
    #!/usr/bin/python
    with open("name.txt", 'w') as file1:
      file1.write("/full/path/to/file2.txt")
  """
}

process process_file2 {
  input: file(file2) from name_ch.map { file(it.text) }

  script:
  """
  echo I have file $file2
  """
}
Laurence E. Bernstein
@lebernstein
Mar 28 18:44
@micans Nice. Thanks. @rsuchecki was actually very close.
tbugfinder
@tbugfinder
Mar 28 19:52
@sivkovic Which NF version are you using, and which instance type, AMI, ECS agents version? Did you investigate cloudwatch logs ?
Tobias "Tobi" Schraink
@tobsecret
Mar 28 20:59
@micans do you know why this works :sweat_smile: ?
Is the it.text required to signal to java that file will always get a string?
Ooooooh, I forgot the output channel is not a value but a file :facepalm:
Sinisa Ivkovic
@sivkovic
Mar 28 21:27
Hey @tbugfinder thanks. I'm using nextflow 19.01. Initially I built Compute environment with these four instance types: m4.large, m4.xlarge, m4.2xlarge, m4.4xlarge and default ECS AMI but just with bigger volume attached. Most of processes in pipeline requires 8 cores and 32GB or RAM, and that worked fine. After that I tested pipeline with a compute environment where I put optimal for instance types. This caused that in some cases more jobs are scheduled to one instance. With this change these two errors started happening. First one is CannotInspectContainerError: Could not transition to inspecting; timed out after waiting 30s, which just caused some jobs on AWS Batch to be in state FAILED although the .exitcode was 0 but it didn’t affect the nextflow run. Second error DockerTimeoutError: Could not transition to created; timed out after waiting 4m0s actually caused error and stop of the nextflow process. First error is happening at least once in every run, and second maybe in 1 of 10 runs in this case.
One of the concerns we had with AWS Batch was how to decide what is the size of volumes we want to attach to the instances. After talking with AWS guys they told us to take a look at EBS autoscalling script they built for Cromwell. To being able to utilize this script I had to make some changes to AMI. The changes I made was to start using overlay2 instead of devicemapper as a storage driver and move docker data-root to different partition. After this change occurrence of second error started happening more often, basically 9 out of 10 times. I tried with both gp2 and io1 volumes but I didn’t saw any major difference. I also tried reconfiguring aws-cli by limiting max_concurrent_requests to 1 but it also didn’t help. What I actually wanted to do first is increasing these timeout limits on AWS batch but I still haven’t find the way to do that. I haven't found any logs on CloudWatch since containers are not started. Also one of the things which I think could maybe help but it requires change on nextflow is using mounted directory as a working dir instead of downloading files to container space. I’ll try to make a test which can reproduce this issue as @pditommaso asked. If you have any ideas what I could try in the meantime please let me know. Thanks
tbugfinder
@tbugfinder
Mar 28 22:02
@sivkovic As you changed your AMI, you could also make use of EFS as a shared storage. I meant cloudwatch logs of the instances.
Rad Suchecki
@rsuchecki
Mar 28 22:11
@tobsecret it.text is simply file content
Tobias "Tobi" Schraink
@tobsecret
Mar 28 22:24
oh that makes tons of sense!