These are chat archives for nextflow-io/nextflow

10th
Dec 2018
Paolo Di Tommaso
@pditommaso
Dec 10 2018 06:56
@markberger not sure to understand if you are referring the workflow work dir, each task scratch path, or something else
micans
@micans
Dec 10 2018 10:23
@PhilPalmer what is exactly happening? Can I just check one thing; you do need merge rather than mix ?
Alexander Peltzer
@apeltzer
Dec 10 2018 11:33
I'm having a pipeline that consumes a TSV and runs the pipeline on a per line basis. Works fine if I have one line, works badly once I have multiple lines, when a database file comes into the game: It works until there, but not after the mask_primers step...
My idea is that this consumes the database file once and doesn't run it for multiple input files then
//Mask them primers
process mask_primers {
    tag "${umi_file.baseName}"

    input:
    file(umi_file) from ch_filtered_by_seq_quality_for_primer_Masking_UMI
    file(r2_file) from ch_filtered_by_seq_quality_for_primerMasking_R2
    file(cprimers) from ch_cprimers_fasta 
    file(vprimers) from ch_vprimers_fasta

    output:
    file "${umi_file.baseName}_UMI_R1_primers-pass.fastq" into ch_for_pair_seq_umi_file
    file "${r2_file.baseName}_R2_primers-pass.fastq" into ch_for_pair_seq_r2_file

    script:
    """
    MaskPrimers.py score -s $umi_file -p ${cprimers} --start 8 --mode cut --barcode --outname ${umi_file.baseName}_UMI_R1
    MaskPrimers.py score -s $r2_file -p ${vprimers} --start 0 --mode mask --outname ${r2_file.baseName}_R2
    """
}
micans
@micans
Dec 10 2018 11:59
@apeltzer Which one is the database file? What is the input from the TSV file? What are the multiple input files? What does it mean doesn't run it for multiple input files ? I'm intrigued.
Alexander Peltzer
@apeltzer
Dec 10 2018 13:19
The input from the TSV is some files and some values - this is then used by several downstream processes on a per line / sample basis and works fine
The problem was (!) that the ch_primers_fasta is simply set up like this: Channel.fromPath("${params.vprimers}") .ifEmpty{exit 1, "Please specify VPrimers FastA File!"} .set { ch_vprimers_fasta }
So the first of these lines consumed the ch_vprimers_fasta and then the process stopped
adding a .collect() to
file(cprimers) from ch_cprimers_fasta 
file(vprimers) from ch_vprimers_fasta
did the trick

@apeltzer Which one is the database file?

The database files were the cprimers/vprimers.

micans
@micans
Dec 10 2018 13:52
Cool, thanks for explaining. I wondered if the solution was similar to that for aligners indexes; they also use collect() in our (nf-core actually) code.
Alexander Peltzer
@apeltzer
Dec 10 2018 14:01
Yes it is the same solution :-)
Although I normally use collect() in a different way, to collect outputs from multiple upstream processes
micans
@micans
Dec 10 2018 15:08

@pditommaso in my collectFile trials I had reason to do this:

process star {
  output: set val('dummy'), file('*.txt') into ch_input
  script: 'for i in {00..09}; do echo "some F $i content" > f$i.txt; done'
}

Channel.from([text: 'some z 1\n'], [text: 'some z 2\n'], [text: 'some z 3\n'])
  .set{ ch_z }

ch_input
  .transpose()
  .map { it[1] }
  .mix(ch_z)
  .collectFile{ ['lostcause.txt', it.text] }
  .set{ ch_merge }

This seems to work (to my suprise), where it.text gets both the text from a file and the value from a dictionary with key text. Is this sanctioned?

Paolo Di Tommaso
@pditommaso
Dec 10 2018 17:02
well, give me the feeling of mixing apple (files) with oranges ([text: 'some z 1\n']) but you can do
micans
@micans
Dec 10 2018 17:47
Thanks! I agree completely, I have the same feeling. Right now it kind of makes sense. It happens because sometimes the script section decides to send something to that channel (using a file), and sometimes a Groovy filter function decides to do it. I'm sure it could be done in a better way, but right now it's useful that it works.
If ugly.
Paolo Di Tommaso
@pditommaso
Dec 10 2018 17:48
put a big comment in that code snippet ..
micans
@micans
Dec 10 2018 17:50
Will do. It's related to nextflow-io/nextflow#903
Stephen Kelly
@stevekm
Dec 10 2018 20:44
is there a way to control the number of process threads that Nextflow spawns up? And if so, can I control the number of file copying threads? Especially in regards to copying files to publishDirs.
Rad Suchecki
@rsuchecki
Dec 10 2018 22:00
Not sure about copying but otherwise maxForks in combination with process cpus.
Stephen Kelly
@stevekm
Dec 10 2018 22:19
Sorry I am referring to the parent Nextflow system process, not the pipeline tasks. For example, running nextflow run main.nf produces a slew of child processes related to Nextflow. I am trying to find ways to better control them.
Rad Suchecki
@rsuchecki
Dec 10 2018 23:20
I see... not sure but it looks that e.g. for file copying it allocates as many threads as available cpu cores
https://github.com/nextflow-io/nextflow/blob/ac2be51dd255151e6ab5056f35811b44b8610af2/src/main/groovy/nextflow/file/FilePorter.groovy#L159-L171
Rad Suchecki
@rsuchecki
Dec 10 2018 23:27
By the way, this may not apply to your use case but I think usually JVM is best left alone (that is not a NF specific comment).