These are chat archives for nextflow-io/nextflow

5th
May 2017
mitul-patel
@mitul-patel
May 05 2017 09:01
Thanks @pditommaso .... here is the code...when I try to use when it doesnt work in correct order,,,,
    FirstINDEXComplete = 'true'
    INDEXComplete = 'false'
    ASSEMBLYComplete = 'false'
    MAPPINGComplete = 'false'

    def pipeline (j) {
            process index {
            publishDir "$outDir", mode:'copy', overwrite: true
           input:
           file fasta from genome

           when:
           FirstINDEXComplete == 'true'

           script:        
           """
           INDEX.py --out ${outDir}
           INDEXComplete = 'true'
           FirstINDEXComplete = 'false'
           """
           }
          process mapping {
          publishDir "$outDir", mode:'copy', overwrite: true

          when:
          INDEXComplete == 'true'

          script:        
          """
          MAPPING.py --out ${outDir}
          MAPPINGComplete = 'true'
          """
          }
          process assembly {
          publishDir "$outDir", mode:'copy', overwrite: true

          when:
          MAPPINGComplete == 'true'

          script:        
          """
          assembly.py --out ${outDir}
          ASSEMBLYComplete = 'true'
         FirstINDEXComplete = 'true'
         """
         }
     }

        def itr = 1
        while (itr <= iterations)
        {
         status = pipeline(itr)
         itr++
         }
Phil Ewels
@ewels
May 05 2017 09:04
Please put long code snippets like this into pastebin or something if you can @mitul-patel - they're almost impossible to read on gitter :)
mitul-patel
@mitul-patel
May 05 2017 09:06
thanks @ewels ... I hope its better now....
Phil Ewels
@ewels
May 05 2017 09:06
:+1:
Paolo Di Tommaso
@pditommaso
May 05 2017 10:47
@mitul-patel that code will never work, you cannot put a process in a method and NF does not allow use while loop to repeat the execution of a process.
mitul-patel
@mitul-patel
May 05 2017 10:51
Thanks @pditommaso . I got the point here......
Paolo Di Tommaso
@pditommaso
May 05 2017 10:51
you need to organise your data so that it will trigger the execution of the same process multiple times
mitul-patel
@mitul-patel
May 05 2017 10:53
and I guess the way to do it is connecting input and output channels of processes . am i right?
Paolo Di Tommaso
@pditommaso
May 05 2017 10:54
yes
mitul-patel
@mitul-patel
May 05 2017 10:55
I am fine with the input channels.... but I got problem with output channels... When I specify output channel i receive Missing file error...
do i have to specify full path where the output file will be written?
or just the file name or folder name...
    process index {
    publishDir '$outDir/reference/iteration1', mode:'copy', overwrite: true

    input:
    file fasta from genome
    output:
    file "${genome_base}" into INDEX

    script:
    """
    Refindex.py --itr 1 --ref ${fasta} --out ${outDir}
    """
      }
I tried this one and got Missing file error............
Evan Floden
@evanfloden
May 05 2017 10:59
file ${outDir} into index
mitul-patel
@mitul-patel
May 05 2017 11:04
Thanks @skptic. I tried it and got this error
            File `/home/mitul.patel/test/run` is out of the scope of process working dir:
Evan Floden
@evanfloden
May 05 2017 11:07
Hard to see from only this script but if I understand your pipeline correctly (index, mapping, assembly) you would do best by understanding one of our example pipelines which does the same
There is no need to try and reinvent it all I have personally found it best to look at other peoples pipelines, ensuring I can run them, and then making sure I can understand how each process is working.
mitul-patel
@mitul-patel
May 05 2017 11:11
I have looked it before... but could not found solution,,,, here in my pipeline my python script is creating directories and writing output files there. I
the script creates two more folders within $outDir and output files goes there.
Evan Floden
@evanfloden
May 05 2017 11:12
Are you still using the complete bool variables??
eg `FirstINDEXComplete = 'true'
mitul-patel
@mitul-patel
May 05 2017 11:13
no....
i used them when I omitted output channels....
Evan Floden
@evanfloden
May 05 2017 11:14
I would suggest you upload your updated script to pastebin and I will have a look now
mitul-patel
@mitul-patel
May 05 2017 11:14
ok...thanks.
Evan Floden
@evanfloden
May 05 2017 11:17
Also I think you are missing a key concept that is often misunderstood. NF handles the flow of the pipeline execution. This means it “knows” what processes have been completed, and it “knows” which inputs are required and which processes have to be completed before the next process can begin.
I’m sure we can get your pipeline running though :thumbsup:
:wink:
mitul-patel
@mitul-patel
May 05 2017 11:20
here is the code......https://pastebin.com/r4N6UC44
Evan Floden
@evanfloden
May 05 2017 11:21
Great!
Can you also upload Refindex.py to be sure there is no problems there too.
mitul-patel
@mitul-patel
May 05 2017 11:23
ok...
here is the Refindex.py https://pastebin.com/nTPqj3Lj
Evan Floden
@evanfloden
May 05 2017 11:29
Maybe I have a niave question, but why do you have to interate over the pipeline 10 times?
I think your problem is that your python is trying to output to absolute paths.
You need to have your the output of you python script be in working directory. NF exectutes each process in its own working directory, which I think you can define with something like $PWD in python.
The out of scope error could be casued by trying write files outside the working directory.
    reference_path=os.path.abspath(args.reference)
    reference_index=os.path.abspath(os.path.join(args.outDir,"reference”))
mitul-patel
@mitul-patel
May 05 2017 11:35
the pipeline has 3 steps : index, mapping, assembly. first iteration will create index which will be used in mapping and output of mapping will be used in assembly. in the second iteration index will use assembly output and pass the index to mapping and so on....
Evan Floden
@evanfloden
May 05 2017 11:35
???????
That is one iteration
mitul-patel
@mitul-patel
May 05 2017 11:36
yes
Evan Floden
@evanfloden
May 05 2017 11:36
So you only need one iteration.
mitul-patel
@mitul-patel
May 05 2017 11:37
it would be nice to run more iterations over the same order: index > mapping > assembly
Evan Floden
@evanfloden
May 05 2017 11:38
That is fine. But those are not iterations…. Those are different runs.
mitul-patel
@mitul-patel
May 05 2017 11:38
if I create directory within process block....and pass the path to python would that work?
Evan Floden
@evanfloden
May 05 2017 11:38
That would be better
mitul-patel
@mitul-patel
May 05 2017 11:39
ok.. i will change that... I have also tried to use each......for iterations.
if I use each in index.,....how can i pass iteration number to mapping and assembly??
Evan Floden
@evanfloden
May 05 2017 11:41
It is automatic. If you have 20 fastq files, but they all use the same index for mapping, NF will automatically run the index process only once, and the mapping 20 times.
What is changing between 1st and 2nd iteration as I am very confused as to what you mean.
?
mitul-patel
@mitul-patel
May 05 2017 11:42
process index {
PublishDir "$outDir", mode:'copy', overwrite: true

input:
file fasta from genome
each itr from 1..iterations

output:
file "$outDir" into indexing

script:
"""
 Refindex.py --itr ${itr} --ref ${fasta} --out ${outDir}

"""
 }
process mapping {
PublishDir "$outDir", mode:'copy', overwrite: true

input:
file index from indexing 

output:
file "$outDir" into mapping

script:
"""
mapping.py --itr ${itr} --ref ${fasta} --out ${outDir}

"""
 }
Evan Floden
@evanfloden
May 05 2017 11:43
But why???
What do you expect to be different between index with itr1 and index with itr2
mitul-patel
@mitul-patel
May 05 2017 11:44
they will be different. As the each iteration will create different assembly based on previous mapping...
Evan Floden
@evanfloden
May 05 2017 11:44
Okay, I get you, it is circular?
mitul-patel
@mitul-patel
May 05 2017 11:44
yes
Evan Floden
@evanfloden
May 05 2017 11:48
Okay, as I understand it, (but we ask @pditommaso to be sure) but NF creates a directed acyclic graph, which by definition, canot have cicular processes. You could do it very simply with a bash script that calls NF though.
Just have the variable $itras a NF command line parameter, and output publishDir to results/$itr
mitul-patel
@mitul-patel
May 05 2017 11:51
you mean writing each process into different NF files (3 NF files) and call them from bash??
Evan Floden
@evanfloden
May 05 2017 11:51
No no no.
set up so:
nextflow run yourpipeline -itr 1
then when complete
nextflow run yourpipeline -itr 2 -input=results/itr1
then
nextflow run yourpipeline -itr 3 -input=results/itr2 etc….
Shellfishgene
@Shellfishgene
May 05 2017 12:47
-bash: module: line 1: syntax error: unexpected end of file -bash: error importing function definition for `BASH_FUNC_module'
Anyone seen this before?
Paolo Di Tommaso
@pditommaso
May 05 2017 12:48
what is reporting this ?
Shellfishgene
@Shellfishgene
May 05 2017 12:48
I'm trying a simple blast example, this is in the .command.log file. Does this actually have to do with "module", the program that loads env vars?
Paolo Di Tommaso
@pditommaso
May 05 2017 12:49
I don't think so, unless you are using it
Shellfishgene
@Shellfishgene
May 05 2017 12:50
I am, maybe I'll try without
Shellfishgene
@Shellfishgene
May 05 2017 13:00
With the full blastn path it works. So the module directive described in the nextflow docs, is that independent of the Linux one? Or will that just call the installed system module command?
Paolo Di Tommaso
@pditommaso
May 05 2017 13:01
it just call the installed system module command
Shellfishgene
@Shellfishgene
May 05 2017 13:05
Now it works, maybe I had a type somewhere
Paolo Di Tommaso
@pditommaso
May 05 2017 13:06
:v:
Shellfishgene
@Shellfishgene
May 05 2017 13:06
I think my NQSII executor works now. Are there tests I can run live on the cluster?
Paolo Di Tommaso
@pditommaso
May 05 2017 13:06
wow, cool
you can try to run this pipeline https://github.com/nextflow-io/rnatoy
provided you have docker installed (I guess no) or the bowtie, tophat and cufflinks tools
Shellfishgene
@Shellfishgene
May 05 2017 13:08
No docker yet. Is that hard to install without root?
Paolo Di Tommaso
@pditommaso
May 05 2017 13:09
nope
maybe you can have some chance to have Singularity installed instead
Shellfishgene
@Shellfishgene
May 05 2017 13:13
Hmm, yes
But I was more asking if there are software tests in nf that systematically test all functions. For example, I'm not sure how to test that deleting of jobs works with my executor.
Paolo Di Tommaso
@pditommaso
May 05 2017 13:15
unfortunately no, the integration tests only runs in our local cluster
Phil Ewels
@ewels
May 05 2017 13:37
@pditommaso - I'm putting together a summary e-mail that is sent when a workflow finishes. Is there a variable somewhere that collects Nextflow log messages? I was thinking that it would be nice to have the [4f/282516] Cached process > fastqc (SRR4238355_subsamp) statements in the e-mail
(though not important, so only if it's easy)
Shellfishgene
@Shellfishgene
May 05 2017 13:59
Do I need to do anything more than add singularity to the path? I get env: singularity: No such file or directory
Paolo Di Tommaso
@pditommaso
May 05 2017 14:19
well, it need to be installed properly
you don't need special to use it, but it still requires root to install it
@ewels no, unfortunately that's not possible
Shellfishgene
@Shellfishgene
May 05 2017 14:22
@pditommaso Yes, I didn't see that it also needs root, I had just changed the prefix... Oh well.
Different question: Should my $PATH be available from nextflow? Or do I have to set that somehow?
Paolo Di Tommaso
@pditommaso
May 05 2017 14:25
not understanding, what do you mean ?
Shellfishgene
@Shellfishgene
May 05 2017 14:26
Will the $PATH variable be the same for the nf jobs that get run via the scheduler as it is for my user when I start nextflow?
Paolo Di Tommaso
@pditommaso
May 05 2017 14:29
the $PATH is the same as in the computing node (not the launcher node)
plus NF prefix it with the project $baseDir/bin folder (if exists)
and any other path you may be specify in the nextflow config file
Shellfishgene
@Shellfishgene
May 05 2017 14:32
so just PATH = "/foo/bar:$PATH" in the config file?
Paolo Di Tommaso
@pditommaso
May 05 2017 14:32
env.PATH = "/foo/bar:$PATH"
or
env {
   PATH = "/foo/bar:$PATH"
}
Shellfishgene
@Shellfishgene
May 05 2017 15:00
Thanks. RNASeq toy example finally ran through, everything seems to work fine.
Paolo Di Tommaso
@pditommaso
May 05 2017 15:00
cool
are you planning to open a pull request for your executor ?
Shellfishgene
@Shellfishgene
May 05 2017 15:01
would you pull without me writing the tests similar to the one's for SGE et al?
Paolo Di Tommaso
@pditommaso
May 05 2017 15:01
well with tests :)
Shellfishgene
@Shellfishgene
May 05 2017 15:01
hehe
I'll see
Shellfishgene
@Shellfishgene
May 05 2017 15:22
blast_result
   .collectFile(name: params.out)
   .println { file -> "matching sequences:\n ${file.text}" }
I don't quite get what file -> does here
Paolo Di Tommaso
@pditommaso
May 05 2017 15:23
the argument of println is the the closure { file -> "matching sequences:\n ${file.text}" }
a closure is an anonymous function that will get invoked for each entry emitted by the channel
each time the closure is invoked the i-th element is passed to the closure as an argument
that argument is the file -> declaration
thus file is the i-th file emitted by the channel that is printed out
Shellfishgene
@Shellfishgene
May 05 2017 15:26
Ok...I think ;). How do I just print that to a single output file?
Paolo Di Tommaso
@pditommaso
May 05 2017 15:26
you already have a single file, is file
remove .text and you will see it
Shellfishgene
@Shellfishgene
May 05 2017 15:28
Duh, ok, thanks.
Paolo Di Tommaso
@pditommaso
May 05 2017 15:28
:+1:
Shellfishgene
@Shellfishgene
May 05 2017 15:28
One more an then I'll stop ;): What's the best way to have that file moved out of the work directories at the end?
Paolo Di Tommaso
@pditommaso
May 05 2017 15:29
for files created by a process publishDir
for any other file you can copy/move it with plain Java/groovy api or these helper methods showed here
Shellfishgene
@Shellfishgene
May 05 2017 15:31
I thought I read most of the docs, but apparently not. Muchas gracias! Have a nice weekend...
Paolo Di Tommaso
@pditommaso
May 05 2017 15:32
Buon fin de semana ! :smile:
Manuel
@kohleman
May 05 2017 15:37
Hi, can I control when a Channel is evaluated? I have a process which transfers fastq.gz files and the uses a regex to find the files. But on the first run it always fails, as the files are not yet there so the Channel is empty. On the secomd run the data (fastq.gz files) are there and the Channel works.
Paolo Di Tommaso
@pditommaso
May 05 2017 15:38
can you show a snippet of your code ?
Manuel
@kohleman
May 05 2017 15:40
process transfer_folder_to_cluster {
    cpus 1
    executor = 'local'
    errorStrategy { task.exitStatus == 255 ? 'retry' : 'terminate' }
    maxRetries 9

    tag "Transfer ${params.baselBaseDir}${params.runFolder}/${flowlane} to ${params.eulerBaseDir}${params.runFolder}/"

    input:
    val(flowlane) from flowlane_transfer

    output:
    file ("${params.runFolder}_${flowlane}") into marker_file

    """
    #echo ${params.baselBaseDir}${params.runFolder}/${flowlane} ${params.eulerBaseDir}${params.runFolder} >> /cluster/home/kohleman/netapp/nextflow/debug.txt
    ssh ${params.destinationCluster}
    lftp -c 'set sftp:connect-program "ssh -a -x -i ${params.idKey}"; connect ${params.connectStringBasel};
      mirror -P ${params.parallelTransfers} ${params.baselBaseDir}${params.runFolder}/${flowlane} ${params.eulerBaseDir}${params.runFolder}/; exit 0;'
    touch ${params.runFolder}_${flowlane}
    """
}
def read_pairs_regex = "${params.eulerBaseDir}${params.runFolder}/*{${params.laneList}}/**R{1,2}_001.fastq.gz"
log.info("Using regex: " + read_pairs_regex)

Channel.fromFilePairs(read_pairs_regex)
       .set{read_pairs}
    //   .println()


process trimmomatic {
    cpus 2
    // executor = 'lsf'
    executor = 'local'
    // memory '6 GB' // per core
    maxRetries 5

    // tag "${pair_id}"
    tag "${params.runFolder}_${extractFolderName(pair_id)[1]}"

    publishDir "${params.eulerResults}/${extractRunFolder(read_pairs_regex)[0]}/${extractFolderName(pair_id)[1]}/${extractFolderName(pair_id)[0]}/${extractFolderName(pair_id)[0]}", mode: 'link'

    input:
    // val(flowlane_number) from flowlane_process
    set pair_id, file(reads) from read_pairs
    file ("${params.runFolder}_${extractFolderName(pair_id)[1]}") from marker_file.toList()

    output:
    file ("${pair_id}*_paired.fastq.gz") into paired_file
    file ("${pair_id}*_unpaired.fastq.gz") into unpaired_file
    file "${pair_id}_out.txt" into stdout
    file "${pair_id}_trim_log.txt" into trim_log

    """
    trimmomatic PE -threads 12 -phred33 -trimlog ${pair_id}_trim_log.txt \
    ${reads} ${pair_id}_R1_paired.fastq.gz ${pair_id}_R1_unpaired.fastq.gz \
    ${pair_id}_R2_paired.fastq.gz ${pair_id}_R2_unpaired.fastq.gz \
    ILLUMINACLIP:/cluster/apps/gdc/trimmomatic/0.35/adapters/TruSeq3-PE-2.fa:2:30:10 \
    LEADING:3 TRAILING:3 SLIDINGWINDOW:4:10 MINLEN:10 > ${pair_id}_out.txt 2>&1
    """
}
Paolo Di Tommaso
@pditommaso
May 05 2017 15:41
first problem learn gitter :)
you need to begin with a triple ` then a newline
good !
the problem is with trimmomatic, right ?
Manuel
@kohleman
May 05 2017 15:45
yes
the problem is that the Channel "read_pairs" is empty
Paolo Di Tommaso
@pditommaso
May 05 2017 15:46
try to check if read_pairs produce the expected content with
Manuel
@kohleman
May 05 2017 15:46
yes, it works on the second run
Paolo Di Tommaso
@pditommaso
May 05 2017 15:46
Channel.fromFilePairs(read_pairs_regex)
       .view()
       .set{read_pairs}
um weird . .
need to go now, sorry
Manuel
@kohleman
May 05 2017 15:46
when the data is there from the process "transfer_folder_to_cluster"
Ok
Phil Ewels
@ewels
May 05 2017 16:09
@pditommaso - instead of log messages, is there any other way to access a list of processes that have run and their tags / status etc?
Shellfishgene
@Shellfishgene
May 05 2017 16:15
@ewels I'm just looking at the doc page for the execution report, do you mean that?
Phil Ewels
@ewels
May 05 2017 16:16
Essentially, yes - but accessible from within the pipeline so that I can include it in an email sent by the completion handler
Shellfishgene
@Shellfishgene
May 05 2017 16:19
ah, ok
Anthony Underwood
@aunderwo
May 05 2017 16:33
This is a really dumb question but I can't seem to figure it out having read docs, examples etc
I'm running a script that outputs a bunch of files, and I want to put just two of these into the output channel
So I do
     set pair_id, file("output_forward_paired.fq.gz"), file("output_reverse_paired.fq.gz") into trimmed_reads_for_QC, trimmed_reads_for_copying
However in a channel receiving this channel I do the following
set pair_id, file(reads) from trimmed_reads_for_QC
and when echoing echo "QCing ${reads}" the stdout only shows the first file
QCing output_forward_paired.fq.gz
Anthony Underwood
@aunderwo
May 05 2017 16:56
Doh - solved it . Needed to specify that I was expecting 2 files as the input
set pair_id, file(for_reads), file(rev_reads) from trimmed_reads_for_QC