These are chat archives for nextflow-io/nextflow

8th
May 2019
Rad Suchecki
@rsuchecki
May 08 00:10

It looks good to me. Clear and concise, I would not describe it as cumbersome. Think of the person needing to maintain your code, it might be you.

:100:

Austin Keller
@austinkeller
May 08 00:32
Is it possible to pass environment variables to a specific container? I have a different docker container for each process but don't see an obvious way to set environment variables for each. The docker.runOptions and docker.envWhitelist seem to be global. When I try to set them under process task_name {} I get an error ERROR ~ No such variable: docker
Austin Keller
@austinkeller
May 08 00:44
Is it possible to set the docker or env configuration scopes per-process. Or are they always global?
Austin Keller
@austinkeller
May 08 04:33
Nevermind! I stumbled across https://www.nextflow.io/docs/latest/process.html#containeroptions and that's exactly what I needed!
Austin Keller
@austinkeller
May 08 05:12
Oh no. It looks like that option isn't supported by AWS Batch (or Kubernetes). I'm still looking for an answer...
Paolo Di Tommaso
@pditommaso
May 08 08:19
Carl Witt
@carlwitt
May 08 09:23
I suppose it is not a good idea to run the same workflow (with different parameters) from the same directory? Because for instance two nextflow instances would try to write to .nextflow.log?
micans
@micans
May 08 09:31
@carlwitt I believe that's correct, certainly for two pipelines running concurrently. I am curious what happens when you run them consecutively; and whether NF caching could still work if you use -resume.
Carl Witt
@carlwitt
May 08 09:36
Ah right, I was aiming for concurrently. When you run them multiple times, nextflow will append a digit to the old file. For instance, I have .nextflow.log.1, .nextflow.log.2, etc. in my workflow directory.
micans
@micans
May 08 09:37
Yes, sorry, running consecutively is fine, concurrently not.
Jason Steen
@jasteen
May 08 10:38

if I have a channel that looks like set baseName, file(vcf), file(tbi) from ch_indexedVCF, how do I collect all the vcf files into a list for use in a future command. I tried

echo ${vcf.join("\n")} > temp.list
bcftools merge -O z -o "merged.vardict.vcf.gz" -l temp.list

but it doesnt work. and if I ch_indexedVCF.collect(), then I cant work out the logic to only write the vcf files into the .list file. The biggest problem i have is that bcftools needs to be able to see the index files when it runs, so I can just separate the two components and read them in separately, as they seem to end up in different work directories.

Evan Floden
@evanfloden
May 08 10:55
If I understand correctly, you can generate the list file as follows using Nextflow operators:
ch_indexedVCF = Channel.from ( ['basenameA', file('A.vcf'), file('A.tbi')],
                               ['basenameB', file('B.vcf'), file('B.tbi')] )

ch_indexedVCF.map { it -> it[1].name }
             .collectFile(name: 'list.txt', newLine: true)
As you also need to include the index files themselves, you need another channel with a .collect() to ensure all the input files are staged in the workdir.
Jason Steen
@jasteen
May 08 11:04
I tried something similar where I just mapped the vcf and the index to separate channels, and set them both as input, but bcftools couldnt see the indexes. i'll try your suggestion.
I have a lot of trouble with "collect" since most of the examples assume you have a channel of single items, and I always seem to have a channel of lists.
Evan Floden
@evanfloden
May 08 11:28
Map then collect is a useful way to go about it.
Anthony Underwood
@aunderwo
May 08 11:28

@eugene.bragin_gitlab I did this just yesterday with

  memory { 2.GB * task.attempt }

  input:
  ....


  output:
  ....

  script:
  spades_memory = 2 * task.attempt

  """
  spades.py --pe1-1 ${file_triplet[1]} --pe1-2 ${file_triplet[2]} --pe1-m ${file_triplet[0]} --only-assembler  -o . --tmp-dir /tmp/${pair_id}_assembly -k ${kmers} --threads 1 --memory ${spades_memory}

  """

seemed to do the trick

Evan Floden
@evanfloden
May 08 11:28
@jasteen
ch_indexedVCF = Channel.from ( ['basenameA', file('A.vcf'), file('A.tbi')],
                               ['basenameB', file('B.vcf'), file('B.tbi')] )
                       .into { files_ch; list_ch }

list_ch.map { it -> it[1].name }
       .collectFile(name: 'list.txt', newLine: true)
       .set {list_f}

files_ch
    .collect()
    .set {all_files}

process foo {

    input:
    file list from list_f
    file '*' from all_files

    script:
    """
    cat $list
    """
}
Jason Steen
@jasteen
May 08 12:13
@evanfloden, i'm 95% sure that worked for my pipeline. thanks heaps. I certainly have a merged vcf file with all my samples. now I just need to work out why vardict isnt actually putting any variants in there, but thats not a nextflow problem. cheers again.
Evan Floden
@evanfloden
May 08 13:38
You are welcome. Wrangling data in channels is the fun part!
Ólafur Haukur Flygenring
@olifly
May 08 15:58

@pditommaso - Hi :)

I've made a couple of issue/new feature tickets regarding the weblog since its vital to the way we're integrating nextflow into our pipelines. Namely nextflow-io/nextflow#1139 and nextflow-io/nextflow#1145 .

If these issues/ideas sound reasonable to you then we'd be happy to work on them and submit pull requests with these fixe/functionality, along maybe with fixes other weblog/traceobserver issues that are easy to solve/implement alongside them :)

welchwilmerck
@welchwilmerck
May 08 16:57
@pditommaso: We're excited to see DSL2 coming together (modules, in particular) and are wondering about your current estimate for merge into master (that is, when could we use it in "production"). Are there alternatives for decomposing complex workflows into separately maintainable files that you might recommend in the meantime?
Stephen Kelly
@stevekm
May 08 17:31

I suppose it is not a good idea to run the same workflow (with different parameters) from the same directory? Because for instance two nextflow instances would try to write to .nextflow.log?
I believe that's correct, certainly for two pipelines running concurrently. I am curious what happens when you run them consecutively; and whether NF caching could still work if you use -resume

@carlwitt @micans yes that is exactly what happens :( For example, I had run my primary workflow (main.nf) and was using cached results for custom processing, then tried to run my reference-file-download workflow in the same directory, and I could no longer use the cached results from the first workflow... very sad because it took 14 hours to regenerate the results from the first workflow, even though they were all still there in the 'work' directory.

Stephen Kelly
@stevekm
May 08 17:39

@jasteen @evanfloden you can also flatten nested lists, for example ones produced by 'groupTuple'

// channel for sample reports; gather items per-sample from other processes
sampleIDs.map { sampleID ->
    // dummy file to pass through channel
    def placeholder = file(".placeholder1")

    return([ sampleID, placeholder ])
}
// add items from other channels (unpaired steps)
.concat(sample_signatures_reformated) // [ sampleID, [ sig_file1, sig_file2, ... ] ]
// group all the items by the sampleID, first element in each
.groupTuple() // [ sampleID, [ [ sig_file1, sig_file2, ... ], .placeholder, ... ] ]
// need to flatten any nested lists
.map { sampleID, fileList ->
    def newFileList = fileList.flatten()

    return([ sampleID, newFileList ])
}
.into { sample_output_files; sample_output_files2 }

I end up using this to collect bunches of files per-sample that may or may not exist, for example if one sample's data did not have enough entries for a certain tool to run, or the presence of certain control samples might have triggered extra analysis steps. But then the code in your 'script' section has to be able to conditionally react to the presence/absence of input files, which gets complicated and ugly fast.

najitaleb
@najitaleb
May 08 17:44
Hi. Say I have an R studio docker container where I run a few scripts each with their own inputs and outputs. How can I get nextflow to accomodate this configuration and execute the scripts inside of the container?
@najitaleb
najitaleb
@najitaleb
May 08 17:51
Thanks a lot
najitaleb
@najitaleb
May 08 17:56
So how would I check the script output inside of the container after I run it?
Actually I think I understand. All the results will appear in the host file system
Austin Keller
@austinkeller
May 08 20:16
@pditommaso I've been able to follow that pattern for specifying different containers for each process, but then if I want to specify different environment variables for each container that doesn't seem to be possible. It's a common Docker usage pattern to pass runtime information in environment variables using the -e flag. AWS Batch also supports passing environment variables via the job definition. But there doesn't seem to be a nice way to specify this in the nextflow definition. The best workaround I've found is to run the nextflow pipeline, have it fail, then go into AWS Batch and update the job definition for each process manually with my environment variables. I'm hoping there's a better way...
@welchwilmerck modules have been already merged on master
Austin Keller
@austinkeller
May 08 20:24
@pditommaso ah maybe that will do it! I'll give that a shot. Thanks!
welchwilmerck
@welchwilmerck
May 08 20:26
Thank you!
Paolo Di Tommaso
@pditommaso
May 08 20:28
:+1:
Austin Keller
@austinkeller
May 08 21:28
@pditommaso That seems to be working for passing through environment variables to docker, but I'm running into an idiosyncrasy when the environment variables have a space.
Paolo Di Tommaso
@pditommaso
May 08 21:28
look strange, open an issue in github
Austin Keller
@austinkeller
May 08 21:29
@pditommaso Will do, thanks again!
Austin Keller
@austinkeller
May 08 21:37