These are chat archives for nextflow-io/nextflow

8th
Jun 2018
Luca Cozzuto
@lucacozzuto
Jun 08 2018 07:56
have you tried with a System.exit(0) if the condition is met (file is present)?
Will Rowe
@will-rowe
Jun 08 2018 08:22
Hi. I was wondering what is the best way to have a process that only takes a subset of read files from another process? E.G. one process QCs reads and only sends reads on to the next process if they are above a certain threshold?
Luca Cozzuto
@lucacozzuto
Jun 08 2018 08:26
If you save the good reads to a file and they are not sorted you can just use head
process QC {
    input:
    file reads from reads()

    output:
    file "filtered_subset.fastq"  into subset
    file "filtered_total.fastq" into filtered_all

    """
    run-myQC ${reads} > filtered_total.fastq
    head -n 40000  filtered_total.fastq > filtered_subset.fastq
    """
}
Will Rowe
@will-rowe
Jun 08 2018 08:31
Thanks. So in your example, if a sample results in filtered_total.fastq that is below my theshold and I don't want to send it to subset, will it not throw an error if subset doesn't receive the expected output?
Luca Cozzuto
@lucacozzuto
Jun 08 2018 08:33
in my example the script run-myQC extracts the reads that pass the threshold
if no read passes the filter the final file (both total and subset) will be empty
Will Rowe
@will-rowe
Jun 08 2018 08:38
ah, okay. So a channel would get the empty file and the next process could evaluate whether it was empty or not?
Luca Cozzuto
@lucacozzuto
Jun 08 2018 08:41
yes you can use a when condition (file empty or not) to trigger the next process.
Will Rowe
@will-rowe
Jun 08 2018 08:42
perfect - thank you so much for your help!
micans
@micans
Jun 08 2018 12:36
As I understand it another options is to have no output files and optional truein the output directive. Perhaps not useful for your case.
Paolo Di Tommaso
@pditommaso
Jun 08 2018 13:39
@ShawnConecone Feel free to comment nextflow-io/nextflow#735
Dave Istanto
@DaveIstanto
Jun 08 2018 14:20
@lucacozzuto I will definitely try that, thank you
@pditommaso Definitely, it would be wonderful, thank you
Pierre Lindenbaum
@lindenb
Jun 08 2018 14:30

Hi all, I've got two questions today :-) :

1) say, my worflow searches for some fastqs in a directory (process scan_input_dir {) , map them successfully with bwa and then call the VCF. One month later a collaborator gives me a new set of fastqs. Is there a mechanism to tell NF to not try to re-map the previous bams ?
2) If a process maps some reads -> mapped.bam , another process sorts the bam -> sorted.bam. If i know mapped.bam will not be used anymore, can I delete the file mapped.bam (following a possible symlink) and then touch mapped.bam (to create an empty file)

thanks !

Alexander Peltzer
@apeltzer
Jun 08 2018 14:32
For 2.) : Can't you just combine these two ( I assume you're using samtools sort or sambamba sort for this?) in a single step without keeping the mapped.bam anyways?
Pierre Lindenbaum
@lindenb
Jun 08 2018 14:34
@apeltzer thanks. I was thinking about a more general case: can I delete+touch a file from previous step without hurting my pipeline if I need to resume the workflow ?
Luca Cozzuto
@lucacozzuto
Jun 08 2018 14:38
I don't remember if using cache false you won't "save" the content of that process for further use
so the pipeline goes on but if you want to re-run, it will re-run that step...
Paolo Di Tommaso
@pditommaso
Jun 08 2018 14:40
1) say, my worflow searches for some fastqs in a directory (process scan_input_dir {) , map them successfully with bwa and then call the VCF. One month later a collaborator gives me a new set of fastqs. Is there a mechanism to tell NF to not try to re-map the previous bams ?
yes, provided you haven't delete old task outputs
to be more precise files you declared in the output: section
Vladimir Kiselev
@wikiselev
Jun 08 2018 14:42
@pditommaso how to install a specific version of NF? e.g. 0.29.1
Paolo Di Tommaso
@pditommaso
Jun 08 2018 14:42
export NXF_VER=0.29.1
done
Vladimir Kiselev
@wikiselev
Jun 08 2018 14:43
Cool, many thanks!
Pierre Lindenbaum
@lindenb
Jun 08 2018 14:43
@pditommaso thanks !
Paolo Di Tommaso
@pditommaso
Jun 08 2018 14:43
:v:
:smile:
Ryo Kita
@rkita
Jun 08 2018 16:15

Hello again, I'm having trouble structuring my nextflow pipeline.
I have 40 tissues, that I ultimately want to create 40 files for.
For each tissue, I split it into 100 chunks and run a separate process for each chunk. Like so:

tissues = Channel.from('A'..'Z')
chunks = Channel.from(1..100)

tissues
    .combine(chunks)
    .set{tissuechunks}

process runAnalysis {

    input:
    set tissue, chunk from  tissuechunks

    output:
    file "output.${tissue}.${chunk}" into tissuechunkResults

    """
    echo 1 > output.${tissue}.${chunk}
    """
}

The problem is that I ultimately want to concatenate all of the chunks together for each tissue, so that I end up with a separate file for each tissue. The order of the chunks doesn't matter to me.

The code below is what I have so far - but there are two problems with it. (1) It waits until all the tissue-chunk pairs are complete (because of the collect statement). and (2) It doesn't separate out the results into different tissues.

process concatenateChunks {

    input:
    file tissuechunkResult from tissuechunkResults.collect()

    output:
    file "tissueresults.txt" into finaloutput

    """
    cat $tissuechunkResult >  tissueresults.txt
    """
}
`

Am I structuring my code improperly for this problem?

Ryo Kita
@rkita
Jun 08 2018 16:24
I'm thinking that I need a separate process for each tissue, but I'm having trouble doing that with the output from the runAnalysis process
Paolo Di Tommaso
@pditommaso
Jun 08 2018 16:52
there's a nice tip in the documentation at this regard
With Nextflow, in most cases, you don't need to take care of naming output files, because each task is executed in its own unique temporary directory, so files produced by different tasks can never override each other. Also meta-data can be associated with outputs by using the set output qualifier, instead of including them in the output file name.
in other words, output the tissue id along the file chunk, then group them together with groupTuple
to summarise
process runAnalysis {

    input:
    set tissue, chunk from  tissuechunks

    output:
    set tissue, file("output.${tissue}.${chunk}") into tissuechunkResults

    """
    echo 1 > output.${tissue}.${chunk}
    """
}

process concatenateChunks {

    input:
    set tissue, file(tissuechunkResult) from tissuechunkResults.groupTuple()

    output:
    file "tissueresults.txt" into finaloutput

    """
    cat $tissuechunkResult >  tissueresults.txt
    """
}
micans
@micans
Jun 08 2018 17:00
I have an input which is a set of files from chname.collect(). I'd like to take a different action if it is just one file. I can do this in the script section, but is it also possible in the groovy code?
Paolo Di Tommaso
@pditommaso
Jun 08 2018 17:02
input: file names from chname.collect()
script:
def c = names.size()
if( c==1 ) 
"""
command_for_single_file
"""
else
"""
command_for_many_files
"""
micans
@micans
Jun 08 2018 17:03
When you do it it looks so simple ... thanks Paolo :thumbsup: have a great weekend
Paolo Di Tommaso
@pditommaso
Jun 08 2018 17:03
same there
Ryo Kita
@rkita
Jun 08 2018 17:15
Thank you Paolo! Btw, just have to say that you and this gitter is such a huge plus for using nextflow. I'm super happy with my experience so far.
Paolo Di Tommaso
@pditommaso
Jun 08 2018 17:16
not for my sanity ! :joy:
Ryo Kita
@rkita
Jun 08 2018 17:22
haha, i can imagine
Dave Istanto
@DaveIstanto
Jun 08 2018 18:17

Hey guys I have a question about clusterOptions, I usually use PBS for executor, and there's a native configuration that sends logs, in the qsub file, it looks like this:
"#PBS -o <some directory/logfile.out>"
I tried using this in the config file:
process {
executor='pbs'
queue='big_mem'
time='00:10:00'
memory='1 GB'
cpus=1
clusterOptions = '-o <some directory>/logfile.out'
}
but it does not seem to be working, has anyone encountered this problem and its potential solution?

thank you again

Paolo Di Tommaso
@pditommaso
Jun 08 2018 18:18
please be a good citizen and format properly code snippets :wink:
triple `
new-line
code
new-line
triple `
Dave Istanto
@DaveIstanto
Jun 08 2018 18:18
oh sorry

Hey guys I have a question about clusterOptions, I usually use PBS for executor, and there's a native configuration that sends logs, in the qsub file, it looks like this:
"#PBS -o <some directory/logfile.out>"
I tried using this in the config file:
```
process {
executor='pbs'
queue='big_mem'
time='00:10:00'
memory='1 GB'
cpus=1
clusterOptions = '-o <some directory>/logfile.out'
}
but it does not seem to be working, has anyone encountered this problem and its potential solution?

thank you again

apologies for spamming, I will do this outside
Paolo Di Tommaso
@pditommaso
Jun 08 2018 18:19
ahah, triple ` before and after the code
Dave Istanto
@DaveIstanto
Jun 08 2018 18:19
Hey guys I have a question about clusterOptions, I usually use PBS for executor, and there's a native configuration that sends logs, in the qsub file, it looks like this:
"#PBS -o <some directory/logfile.out>"
I tried using this in the config file:

process {
executor='pbs'
queue='big_mem'
time='00:10:00'
memory='1 GB'
cpus=1
clusterOptions = '-o <some directory>/logfile.out'
}
but it does not seem to be working, has anyone encountered this problem and its potential solution?

thank you again

apologies for spamming, I will do this outside
Paolo Di Tommaso
@pditommaso
Jun 08 2018 18:19
:+1:
Dave Istanto
@DaveIstanto
Jun 08 2018 18:19
sorry for that
Paolo Di Tommaso
@pditommaso
Jun 08 2018 18:20
no pb
the -o option is used by NF, you cannot override it
Dave Istanto
@DaveIstanto
Jun 08 2018 18:21
oh I see, so is there no way of getting around this problem?
Paolo Di Tommaso
@pditommaso
Jun 08 2018 18:22
what's you specific need ?
Dave Istanto
@DaveIstanto
Jun 08 2018 18:23
I want to get the logs of the cluster I am using (not the nextflow log), and have it written on a file specific to the directory
Paolo Di Tommaso
@pditommaso
Jun 08 2018 18:25
you can explicitly specify an output .command.log and use publishDir to copy it
process foo {
publishDir '/some/path'
output:
file '.command.log'
script:
.. etc
}
Dave Istanto
@DaveIstanto
Jun 08 2018 18:35
hmm I think the problem I have is very specific to my workplace, I'll try to find the workaround in my workplace, but thanks @pditommaso!! like @rkita said this gitter, the group, and yourself are very helpful for newcomers, since there are not as many nextflow questions out there
Paolo Di Tommaso
@pditommaso
Jun 08 2018 18:36
:+1:
Félix C. Morency
@fmorency
Jun 08 2018 18:44
When a file gets WARN: Failed to publish file: ..., will it gets retried at some point?
Paolo Di Tommaso
@pditommaso
Jun 08 2018 18:47
nope
I think there's a feature request for that
but likely it's a permanent error
Félix C. Morency
@fmorency
Jun 08 2018 18:50
ok. looks like a cluster / shared fs glitch on our end. the pipeline continues to process normally.