These are chat archives for nextflow-io/nextflow

5th
Dec 2018
Cedric
@Puumanamana
Dec 05 2018 03:13
Thanks for your help. Regarding your suggestion, it would work. So I need to check in the output section the number of created files and store something different depending?
The code is there https://github.com/Puumanamana/16S-pipeline/blob/master/main.nf. But here is the part I am talking about ( I removed unnecessary parts):
process FilterAndTrim {
    input:
        set val(pairId), file(fastq) from INPUT_FASTQ
    output:
        set val(pairId), file("${pairId}*_trimmed.fastq") into FASTQ_TRIMMED_FOR_MODEL

    script:
    """
    #!/usr/bin/env Rscript
   source(...)

   if ( ${params.revRead} == 1 ) {
        filterReads("${pairId}", "${fastq.get(0)}","${fastq.get(1)}")
    } else {
        filterReads("${pairId}", "${fastq.get(0)}")
    }
    """
}

process LearnErrors {
    input:
    set val(pairId), file(fastq) from FASTQ_TRIMMED_FOR_MODEL
    output:
    set val(pairId), file("${pairId}*.RDS") into ERROR_MODEL

    script:
    """
    #!/usr/bin/env Rscript
     source(...)

    if (${params.revRead} == 1) {
        learnErrorRates("${fastq.get(0)}","${pairId}1")
        learnErrorRates("${fastq.get(1)}","${pairId}2")
    } else {
        learnErrorRates("${fastq}","${pairId}")
    }
    """
}
Alexander Peltzer
@apeltzer
Dec 05 2018 09:28
Is there a way similar to the automatic docker.temp = 'auto' feature in docker to set something similar in singularity?
Docker runs fine as the pipeline uses the tmp location on the storage that I force it to use ;-) but in the singularity case its running until /tmp is full and then we're done :-(
Tintest
@Tintest
Dec 05 2018 10:17

Hello, once again I got a "silly" question :

I got a splitCsv generated channell, which emit 3 row, but only the first one is taken into account by my process. Any idea ?

Here some code :

Channel
  .fromPath('/bettik/tintest/PROJECTS/Test_nextflow_OAR/SPARK_ID_Illumina.txt')
  .splitCsv(header:true, sep: '\t')
  .map{ row-> tuple(row.patientno, file(row.tumorbam),file(row.tumorbamindex),row.tumorID, file(row.healthybam),file(row.healthybamindex),row.healthyID) }
  .set { mutect_ID }

Which emit fior example:


[patient2, a , b , c , d, e , f]
[patient3, g , h , i , j , k , l]
[patient1, m , n , o , p , q , r]

Here is the process:


    process mutect2 {
        errorStrategy 'finish'
        publishDir "${params.resultDir}/${patientno}/${chr}/", mode: "copy" , pattern: "*.{gz,bam}"
        maxForks params.maxJob
        cpus params.nCpu
        echo true

        input:
        set val(patientno), file(2),file(3),val(4),file(5),file(6),val(7) from mutect_ID
        each chr from chr_Mutect2


        output:
        set val(patientno), file(2),file(3),val(4),file(5),file(6),val(7) into mutect2_ch


        script:


        """
        echo ${patientno}
        """
}

It works for every chromosome such as requested, but only for patient 2, the first row of splitCsv. Any idea of which operator should I use for make it work for every row ?
Thank you :)

Paolo Di Tommaso
@pditommaso
Dec 05 2018 14:24
@Puumanamana this is a quirk we need to improve, best solution so far is something like this
@apeltzer in principle it should be possible, however I don't remember how singularity tmp folder
@Tintest I guess it depends by the chr input, this is a classic mistake, check here
Daniel E Cook
@danielecook
Dec 05 2018 16:26

I have generated two channels that I am trying to combine but I am getting the following error:

ERROR ~ Not a valid `by` index: [0, 1]
Channel.fromPath(['/path/to/*.bam']) \
       .map { it: parse_basename_id(it) } \
       .take(limit) \
       .into { TO_NORMAL; TO_TUMOR } \

NORMAL = TO_NORMAL.filter { it[4] == "GL" }
TUMOR = TO_TUMOR.filter { it[3] ==~ /T[0-9]+/ }

/* Restructure the Tumor Normal Channel

    [Site,
     ID,
     bam group,
     timepoint,
     tumor_n,
     region_n,
     tumor bam path,
     normal_bam path]
*/
TUMOR_NORMAL = TUMOR.combine(NORMAL, by: [0,1]) \
                      .map { it[0..5] + [it[6]] + [it[10]] }
/* parse_basename_id --> Just extracts the basename */

If I add .spread(['test']) to the end of the NORMAL = and TUMOR = lines...it works.

Any idea how to get TUMOR to combine with NORMAL by an index without adding the spread operator? Why is it throwing an error here?
micans
@micans
Dec 05 2018 16:33
what should the result of the combination be? by takes a single index. Are you trying to mix the channels, or are individual tumor and normal items linked in a specific way?
Daniel E Cook
@danielecook
Dec 05 2018 16:37
I'm trying to produce the cartesian product
so Germline x multiple tumor
Oh - important sidenote, the parse_basename_id parses the basename and produces a list of items to match on. E.g. [Identifier_1, Identifier_2, Path]
Tintest
@Tintest
Dec 05 2018 16:39
@pditommaso Thank you for your answer. Following your example, how can I do 1a, 1b,1c ,2a,2b,2c ? Ideally I would like to use the each operator on a channel with multiple inputs (mutect_ID). But I don't think it can work.
micans
@micans
Dec 05 2018 16:47
@danielecook I just noticed multiple indexes can be specified with list a integers, so OK. Still, the error message is trying to tell you something. Is Path (in the tuple with id1, id2) the thing in the original Channel?
This example works ..
#!/usr/bin/env nextflow
a = Channel.from(['a', 'A', 1],['b', 'B', 3],['c', 'C', 5],['d', 'D', 7])
b = Channel.from(['a', 'A', 2],['b', 'B', 4],['c', 'C', 6],['d', 'D', 8])
a.combine(b, by: [0,1]).view()
micans
@micans
Dec 05 2018 16:54
Sorry, that was perhaps not very helpful, but I would insert some view()s in your code to see what's happening.
Daniel E Cook
@danielecook
Dec 05 2018 19:16
@micans - yeah I really can't figure it out. If I leave out the by option and just do a cartesian product I wind up with a nested array (e.g. ['a', 'A', 1, ['b', B', 3]])
Daniel E Cook
@danielecook
Dec 05 2018 19:36
Further looking into this - and there seems to be an issue with the filter closure - if I filter using different nth elements of an array (e.g. it[3] in one array and it[4] in another...) ... I get the weird nested behavior above; but if I filter using the same nth feature the combine operation works as expected...
Cedric
@Puumanamana
Dec 05 2018 19:49
@pditommaso : Thanks, I'll try that
Cedric
@Puumanamana
Dec 05 2018 21:06
@pditommaso Actually, I found a trick of my own. Instead of using a .get() method, I convert the filenames to strings and handle it with the language used in the process:
#!/usr/bin/env Rscript
fastqs <- c("${fastq.join('","')}")
Joe Brown
@brwnj
Dec 05 2018 22:47
Is the data that's output to trace available within the workflow? I want to know if I can easily access those data from within onComplete or do I need to track the same data some other way.
Daniel E Cook
@danielecook
Dec 05 2018 23:41
To follow up on my problem - it turns out that a function I was using to parse data in a channel was converting channels to different data types which was causing issues when using .combine. The solution was to convert back to lists using .toList()
micans
@micans
Dec 05 2018 23:49
@danielecook persistance payed off!