These are chat archives for nextflow-io/nextflow

10th
Jan 2019
Stephen Kelly
@stevekm
Jan 10 01:02
  • which is the most stable
Anthony Underwood
@aunderwo
Jan 10 09:43
@stevekm I have found AWS batch support to be stable with a couple of minor glitches but generally excellent. Queue is set up and costs $0 until jobs are submitted, machines fire up up until vCPU max limit specified for compute env and process jobs, and at the end all the machines automatically terminate. Worry free!
I haven't tried Google compute so can't comment on that
Anand Mayakonda
@PoisonAlien
Jan 10 11:14

Hi, can anyone help me to solve this. I have a process which creates a json file upon finishing.

process gemBS_map{

  memory '80 GB'
  executor 'local'

  publishDir file("${results_dir}/04_mapping/")

  input:
    set sample_name, json from sample_jsons

  output:
    file "${sample_name}.json" into bam_ch

  """
  gemBS -j ${json} map
  """
}

However process nextflow complains about missing output file.

Missing output file(s) `r.json` expected by process `gemBS_map (1)`

where r in r.json is the sample_name and the file is created successfully, yet the error.

Luca Cozzuto
@lucacozzuto
Jan 10 11:17
what is ${json} ? is another json file?
Anand Mayakonda
@PoisonAlien
Jan 10 11:19
Nope, its the input json. Process creates another json file.
Luca Cozzuto
@lucacozzuto
Jan 10 11:20
but with another name? (I'm just wondering if there is some collision)
Anand Mayakonda
@PoisonAlien
Jan 10 11:22
Since the process creates it in another directory (04_mapping) shouldnt it be different ?
Luca Cozzuto
@lucacozzuto
Jan 10 11:23
no because you link the json in this folder. So If they have the same name you are in trouble
also you miss file
input:
    set sample_name, file(json) from sample_jsons
json and ${sample_name}.json should be different
or you can rename json
input:
    set sample_name, file("input_JSON.json") from sample_jsons

 """
  gemBS -j input_JSON.json map 
  """
but then you need to specify another output
Anand Mayakonda
@PoisonAlien
Jan 10 11:29

Ah, I see. Also gemBS command creates a bam file too. I used it as an output file name and I still have this error.

process gemBS_map{

  memory '80 GB'
  executor 'local'

  publishDir file("${results_dir}/04_mapping/")

  input:
    set sample_name, json from sample_jsons

  output:
    file "${sample_name}.bam" into bam_ch

  """
  gemBS -j ${json} map
  """
}

Error:

Caused by:
  Missing output file(s) `r.bam` expected by process `gemBS_map (1)`
Paolo Di Tommaso
@pditommaso
Jan 10 12:22
this means the task is not creating the r.bam file
Martin Proks
@matq007
Jan 10 12:47
Hey, is there a way to use manifest.version variable inside a configuration? I am trying to make something like this
withName:ericscript { container = "nfcore/rnafusion:ericscript_v${manifest.version}" }
Anand Mayakonda
@PoisonAlien
Jan 10 12:48
Its has created. I can access it. But nextflow doesnt seem to recognize. And also sorry I just posted this issue on google group.
Paolo Di Tommaso
@pditommaso
Jan 10 12:51
nextflow doesnt seem to recognize
does it exist in the task work dir?
Anand Mayakonda
@PoisonAlien
Jan 10 12:52
yes, its present in ${results_dir}/04_mapping/
Paolo Di Tommaso
@pditommaso
Jan 10 12:52
no, the work dir with the hash number printed in the error message
Anand Mayakonda
@PoisonAlien
Jan 10 12:55
No, its not in the work dir. Its in one of the sub-directory.
Paolo Di Tommaso
@pditommaso
Jan 10 12:55
that's the problem
output:
file 'some/path'
must match the declared path
Anand Mayakonda
@PoisonAlien
Jan 10 12:56
So should I use the complete path ? something like below ?
output:
    //set val(sample_name),  file("*.bam") into bam_ch
    //set file("*.json") bam_json_ch
    file "${results_dir}/04_mapping/${sample_name}.bam" into bam_ch
Ah! got it. Will try..
Paolo Di Tommaso
@pditommaso
Jan 10 12:56
I guess
output:
file "04_mapping/${sample_name}.bam"
@matq007 frankly I don't know, make a test
Martin Proks
@matq007
Jan 10 13:00
@pditommaso it didn't work, looks like manifest is not that available, nvm, I've taken a different path :)
Paolo Di Tommaso
@pditommaso
Jan 10 13:00
I see
Paolo Di Tommaso
@pditommaso
Jan 10 14:12
19.01.0-edge is out
Maxime Garcia
@MaxUlysse
Jan 10 14:19
Good news
Paolo Di Tommaso
@pditommaso
Jan 10 14:20
Hope so :D
Maxime Garcia
@MaxUlysse
Jan 10 14:22
I'm having fun with the modules
I'm hoping to see that soon too ;-)
Paolo Di Tommaso
@pditommaso
Jan 10 14:22
nice, it would be interesting to see how it works with Sarek or other nf-core pipelines
micans
@micans
Jan 10 14:24
:beers:
Maxime Garcia
@MaxUlysse
Jan 10 14:24
I'll keep you updated with my progress
But I think it has such a great potential
Paolo Di Tommaso
@pditommaso
Jan 10 14:25
:v:
Anthony Underwood
@aunderwo
Jan 10 14:28
@pditommaso Does edge have modules or is that still on it's own branch?
Paolo Di Tommaso
@pditommaso
Jan 10 14:28
nope
it still needs some work that stuff
Anthony Underwood
@aunderwo
Jan 10 14:28
Is it ear marked for the next stable release?
Paolo Di Tommaso
@pditommaso
Jan 10 14:29
ear ?
Anthony Underwood
@aunderwo
Jan 10 14:29
Sorry British idiom. Is it scheduled to be included in the stable release.
It works like a dream with my 'simple' workflow :)
Paolo Di Tommaso
@pditommaso
Jan 10 14:29
I was suspecting that :D
not sure, more like the july one, I need to test and collect different use cases
micans
@micans
Jan 10 14:30
I'm slow with the modules ... will use it once I need to re-use the same process. I guess our irods processes are prime candidates. But NF is already very expressive/declarative and separates config nicely. Things like publishDir, when, multiple channels, and especially the lack of re-use in my case mean I don't make the jump yet. It's good to know the awesomeness exists.
Paolo Di Tommaso
@pditommaso
Jan 10 14:37
also let's say welcome to @delagoya and @phupe as new project contributors !
micans
@micans
Jan 10 14:40
Welcome!
Tobias Neumann
@t-neumann
Jan 10 16:40

I'm trying to get an EC2 instance going with nextflow, but it keeps launching them without any associated key pairs, so I cannot connect. This is the command I'm using (yes, all files exist):

tobias.neumann@login-02 [BIO] ~/tmp/slurmupdate $ nextflow -C ~/dev/ObenaufLab/virus-detection-nf/config/general.config cloud create testmaster
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp
Fetching EC2 prices (it can take a few seconds depending your internet connection) ..
> cluster name: testmaster
> instances count: 1
> Launch configuration:
 - driver: 'aws'
 - imageId: 'ami-0f99d00928be3a282'
 - instanceType: 't2.micro'
 - keyFile: ~/aws/awsbatch.pem
 - userName: 'ec2-user'

Please confirm you really want to launch the cluster with above configuration [y/n] y
Launching master node -- Waiting for `running` status.. ready.
Login in the master node using the following command:
  ssh -i <path to your private key file> ec2-user@ec2-18-184-21-105.eu-central-1.compute.amazonaws.com

So it says it set up everything correctly, but Amazon tells me there's no key pair associated and thus I cannot connect. Anybody got ideas?

tbugfinder
@tbugfinder
Jan 10 19:03
@t-neumann did you check .nextflow.log ?
Tobias Neumann
@t-neumann
Jan 10 19:08
@tbugfinder there's no .nextflow.log produced - I'm not running a pipeline right. or am I missing something
Tain Mauricio Velasco Luquez
@TainVelasco-Luquez
Jan 10 19:59
Hello Nextflowers!!!
.
Stephen Kelly
@stevekm
Jan 10 20:05
@PoisonAlien the files in your Nextflow process are created in a temporary directory under 'work' during task execution. They only get moved to the 'publishDir' upon task completion. When you refer to the output file, you give it the name of the file that was produced in your task, you do not give it any kind of path to the file unless there is a subdir created during your task execution
Tain Mauricio Velasco Luquez
@TainVelasco-Luquez
Jan 10 20:20

Hello nextflowers!!

I am trying to run STAR over several single-end files and I want to collect the output in a folder with the same name as the file. I also want to load into memory the reference genome before any alignment and, after all files have been aligned, unload it. So far this is my code:

params.reads
params.star_index
outDir = params.outdir
params.cpus

Channel
    .fromPath( params.reads )
    .ifEmpty { error "Cannot find any reads matching: ${params.reads}" }
    .map { file -> tuple(file.simpleName, file) }
    .toList()
    .set { dataset }

process STAR_alignment {

  tag "${datasetID}"

  publishDir mode: "move",  pattern: "${datasetID}/*Aligned*",  path: { "${outDir}/${datasetID}/" }

  input:
  set datasetID, file( datasetFile ) from dataset

  output:
  file( "${datasetID}/Aligned.toTranscriptome.out.bam" )
  file( "${datasetID}/Aligned.out.bam" )

  shell:

    """
    STAR --genomeLoad LoadAndExit --genomeDir ${params.star_index}

    for every ${datasetFile} in ${dataset}
        do
            mkdir ${datasetID}
    STAR \\
        --runThreadN ${params.cpus} \\
        --genomeDir ${params.star_index} \\
        --readFilesIn ${datasetFile} \\
        --readFilesCommand gunzip -c \\
        --outSAMtype BAM Unsorted \\
        --quantMode TranscriptomeSAM \\
        --outFileNamePrefix ${datasetID}/
    done

    STAR --genomeLoad Remove --genomeDir ${params.star_index}
    """

  }

I am having to problems:
1. It appears the STAR --genomeLoad LoadAndExit --genomeDir ${params.star_index} and the STAR --genomeLoad Remove --genomeDir ${params.star_index} are running every time the toListchannel emits an item. 2. Apparently the toListnot only emits the file but also the string input.1 thus causing an error:

Command executed:

  STAR --genomeLoad LoadAndExit --genomeDir /home/tain/Documents/RNA_seq/My_genomes/GCA.GRCh38/GCA_000001405.15_GRCh38.STAR.index/

  for every input.1 SRR5442959_fastp.fastq.gz in DataflowVariable(value=[[SRR5442952_fastp, /home/tain/Documents/RNA_seq/santos/Raw_files/Results/cleaned_files/SRR5442$
52_fastp.fastq.gz], ...]])
      do
          mkdir [SRR5442952_fastp, /home/tain/Documents/RNA_seq/santos/Raw_files/Results/cleaned_files/SRR5442952_fastp.fastq.gz]

STAR \
      --runThreadN 23 \
      --genomeDir /home/tain/Documents/RNA_seq/My_genomes/GCA.GRCh38/GCA_000001405.15_GRCh38.STAR.index/ \
      --readFilesIn input.1 SRR5442959_fastp.fastq.gz \
      --readFilesCommand gunzip -c \
      --outSAMtype BAM Unsorted \
      --quantMode TranscriptomeSAM \
      --outFileNamePrefix [SRR5442952_fastp, /home/tain/Documents/RNA_seq/santos/Raw_files/Results/cleaned_files/SRR5442952_fastp.fastq.gz]/
  done

  STAR --genomeLoad Remove --genomeDir /home/tain/Documents/RNA_seq/My_genomes/GCA.GRCh38/GCA_000001405.15_GRCh38.STAR.index/

Command exit status:
  2

Any help will be deeply appreacited.

Best!

Stephen Kelly
@stevekm
Jan 10 20:22
@PoisonAlien I have an example demonstrating here: https://github.com/stevekm/nextflow-demos/tree/master/make-files
@TainVelasco-Luquez can you print out the contents of dataset? e.g. dataset.subscribe { "${it}" }
without running the rest of the script, that is
Stephen Kelly
@stevekm
Jan 10 20:28
  1. It appears the STAR --genomeLoad LoadAndExit --genomeDir ${params.star_index} and the STAR --genomeLoad Remove --genomeDir ${params.star_index} are running every time the toListchannel emits an item.
This is to be expected, this is exactly how Nextflow processes work
it runs the process for every item emitted from the input channel
Stephen Kelly
@stevekm
Jan 10 20:33
@TainVelasco-Luquez there are a lot of things wrong with that Nextflow script, you might want to check out this example as well and some of the others in this repo: https://github.com/stevekm/nextflow-demos/blob/master/make-files/main.nf
Tain Mauricio Velasco Luquez
@TainVelasco-Luquez
Jan 10 20:38
@stevekm Thank you for your answer. I am going to look at the examples right away.
Stephen Kelly
@stevekm
Jan 10 20:38
dataset.subscribe { println "${it}" }
I am writing you a small example right now if you give me a minute
Tain Mauricio Velasco Luquez
@TainVelasco-Luquez
Jan 10 20:39
I have already checked the examples. Will it be possible to set a different process to load the genome into memory and aanother one to romove it after alignment has been done.
Stephen Kelly
@stevekm
Jan 10 20:47

starting files:

$ ll
total 168
drwxr-xr-x  22 kellys04  NYUMC\Domain Users   748B Jan 10 15:45 .
drwxr-xr-x  37 kellys04  NYUMC\Domain Users   1.2K Jan 10 15:35 ..
-rw-r--r--   1 kellys04  NYUMC\Domain Users   738B Jan 10 15:45 main.nf
-rwx--x--x   1 kellys04  NYUMC\Domain Users    14K Jan 10 15:37 nextflow
-rw-r--r--   1 kellys04  NYUMC\Domain Users     0B Jan 10 15:36 sample1.fastq
-rw-r--r--   1 kellys04  NYUMC\Domain Users     0B Jan 10 15:36 sample2.fastq
-rw-r--r--   1 kellys04  NYUMC\Domain Users     0B Jan 10 15:36 sample3.fastq

script:

params.genome_dir = "/some/dir"
params.cpus_to_use = 1
Channel.fromPath("*.fastq").into { input_ch1; input_ch2 }

input_ch1.subscribe { println "[input_ch1]: ${it}" }
input_ch2.map { file ->
    tuple(file.simpleName, file)
}
.into { input_ch2_1; input_ch2_2 }

input_ch2_1.subscribe {  println "[input_ch2_1]: ${it}"  }



process run {
    tag "${sampleID}"
    cpus params.cpus_to_use
    echo true

    input:
    set val(sampleID), file(fastq) from input_ch2_2

    output:
    file("${sampleID}.bam")

    script:
    output_bam = "${sampleID}.bam"
    """
    echo "load genome dir: ${params.genome_dir}"
    echo "running sample ${sampleID}, file ${fastq}, with ${params.cpus_to_use} CPUs"

    touch "${sampleID}.bam"
    """

}

output:

./nextflow run main.nf
N E X T F L O W  ~  version 18.10.1
Launching `main.nf` [sick_allen] - revision: 286103c692
[input_ch1]: /Users/kellys04/projects/nextflow-demos/test2/sample1.fastq
[input_ch1]: /Users/kellys04/projects/nextflow-demos/test2/sample2.fastq
[input_ch1]: /Users/kellys04/projects/nextflow-demos/test2/sample3.fastq
[input_ch2_1]: [sample1, /Users/kellys04/projects/nextflow-demos/test2/sample1.fastq]
[input_ch2_1]: [sample2, /Users/kellys04/projects/nextflow-demos/test2/sample2.fastq]
[input_ch2_1]: [sample3, /Users/kellys04/projects/nextflow-demos/test2/sample3.fastq]
[warm up] executor > local
[14/c63749] Submitted process > run (sample1)
[6e/41f2e9] Submitted process > run (sample3)
[86/7abd2a] Submitted process > run (sample2)
load genome dir: /some/dir
running sample sample1, file sample1.fastq, with 1 CPUs
load genome dir: /some/dir
running sample sample3, file sample3.fastq, with 1 CPUs
load genome dir: /some/dir
running sample sample2, file sample2.fastq, with 1 CPUs

I think this demonstrates the things you are trying to do

Will it be possible to set a different process to load the genome into memory and aanother one to romove it after alignment has been done.

Someone else would have to weigh in on that, I am not sure how such a thing would work, there is an option for using a ram-disk under /dev/shm I believe listed here: https://www.nextflow.io/docs/latest/process.html?highlight=set#scratch

However I am not sure if its what you would want for this purpose

Usually if you have to load a file into memory like this, you are doing it in separately instances, since each process is isolated from each other
Stephen Kelly
@stevekm
Jan 10 20:53
if you have /dev/shm available maybe there is a way to put the file there then pass it into the process like any other file? Not sure if that would work, or if it would be compatible with STAR
Tain Mauricio Velasco Luquez
@TainVelasco-Luquez
Jan 10 21:18

Well @stevekm thank you very much for your insight. I have made a little run with your modifications:

 params.reads

 Channel
     .fromPath( params.reads )
     .map { file -> tuple(file.simpleName, file) }
     .set { dataset }

 process STAR_alignment {

   tag "${datasetID}"

   input:
   set val(datasetID), file( datasetFile ) from dataset

   output:
   file(  "${datasetID}.bam" )

   shell:

     """
     echo "simple name of file ${datasetID}"
  echo "file name ${datasetFile}"
     touch "${datasetID}.bam"
     """

   }

output:


N E X T F L O W  ~  version 18.10.1
Launching `try.nf` [furious_heyrovsky] - revision: 8f91a54979
[warm up] executor > local
[44/dfe489] Submitted process > STAR_alignment (SRR5442960_fastp)
[50/5ddd20] Submitted process > STAR_alignment (SRR5442972_fastp)
[60/d1b47e] Submitted process > STAR_alignment (SRR5442952_fastp)
[d8/e506fe] Submitted process > STAR_alignment (SRR5442949_fastp)
[a2/bbfc85] Submitted process > STAR_alignment (SRR5442958_fastp)
[6e/b97655] Submitted process > STAR_alignment (SRR5442951_fastp)
[ef/58cb3f] Submitted process > STAR_alignment (SRR5442950_fastp)
[d3/b92c34] Submitted process > STAR_alignment (SRR5442956_fastp)
[fe/a1da91] Submitted process > STAR_alignment (SRR5442955_fastp)
[ed/634df2] Submitted process > STAR_alignment (SRR5442971_fastp)
[cf/1ed8b2] Submitted process > STAR_alignment (SRR5442970_fastp)
[7c/edeb54] Submitted process > STAR_alignment (SRR5442969_fastp)
[ca/e2d7af] Submitted process > STAR_alignment (SRR5442959_fastp)
[70/378131] Submitted process > STAR_alignment (SRR5442953_fastp)
[dd/0a0f63] Submitted process > STAR_alignment (SRR5442954_fastp)
[55/acddec] Submitted process > STAR_alignment (SRR5442957_fastp)

 Pipeline completed at: Thu Jan 10 16:12:51 COT 2019

 Execution status: OK

I am going to implement in with STAR and see how it goes.

Regarding the memory thing, may be it will be clearer with this:

All this nextflow pipeline is for implementing this for loop:

STAR --genomeLoad LoadAndExit --genomeDir index.150

for i in $(ls raw_data | sed s/_[12].fq.gz// | sort -u)
do
    STAR [...]
done

STAR --genomeLoad Remove --genomeDir index.150

Do you think it is feasible/efficient to implement it on nextflow?

Stephen Kelly
@stevekm
Jan 10 22:26
I am not familar with STAR, never used it, so I am not sure how the genome load part works. The big difference is that in your for loop, all the files are being processed in the same instance, while in Nextflow they are being processed in independent instance
not sure how to describe it better in this situation
Rad Suchecki
@rsuchecki
Jan 10 23:30
Assuming you are running all on one node/server @TainVelasco-Luquez -
  1. Have a process which will do the genome load and, importantly in this process define an output of a log file, stdout or whatever to a singleton channel.
  2. Make sure that STAR_alignment process takes the above as input - this will ensure that genome is loaded into memory before alignment is executed