These are chat archives for nextflow-io/nextflow

25th
Feb 2017
Karin Lagesen
@karinlag
Feb 25 2017 12:23
good morning/afternoon/evening
I am a bit confused regarding how to do a split/gather thing in nextflow
I am running a program (fastqc) on several files, and I get those put into different directories under work
I have a script that does postprocessing on these, which takes a directory as input (it gathers all files of a specific type, in this case .zip, processes them and prints a report)
I understand from reading publishDir that I can't/shouldn't run the next part of the pipepline, i.e. the script on the publishDir, I should do that on something within work
but, atm there is nothing within work (except work itself) could be used as input to my script
Karin Lagesen
@karinlag
Feb 25 2017 12:28
so, any pointers to resolving my confusing would be most welcome :)
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:25
@karinlag Hi, does not work the solution proposed here ?
Karin Lagesen
@karinlag
Feb 25 2017 14:32
I got part 1 working, yes
thanks btw :)
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:33
welcome!
Karin Lagesen
@karinlag
Feb 25 2017 14:33
still trying to figure out part 2 though
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:33
yes
Karin Lagesen
@karinlag
Feb 25 2017 14:34
I understood from reading about publishDir that I should not use that as input to a process
and besides, I understand that things really should stay and be worked on in work until things are properly done
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:35
the point of NF that it manages all intermediate files in the work folders
Karin Lagesen
@karinlag
Feb 25 2017 14:35
exactly (which is most commendable btw(
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:35
ok
Karin Lagesen
@karinlag
Feb 25 2017 14:36
but, I then have a script that should run things to summarize results once all of the fastcs are done, and that script gets the directory containing the fastqcs as a command line input
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:36
so now, one solution is that you just process the fastqc output just after that commnd
does that make sense ?
Karin Lagesen
@karinlag
Feb 25 2017 14:36
I am not processing each fastqc result individually
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:36
ahh
right
Karin Lagesen
@karinlag
Feb 25 2017 14:37
what I am doing is summarizing fastqc results, to figure out the status of my read set
I am accessing each of the fastqc result files yes, but that happens inside of the python script I want to call from nf
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:38
you want to collect all the fastqc runs outputs and post-process them
Karin Lagesen
@karinlag
Feb 25 2017 14:38
yes
and I want to collect them in one directory so I can use that directory as the input for the next process
(all without having them leave work of course)
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:39
the fastqc outputs have unique file names, right?
Karin Lagesen
@karinlag
Feb 25 2017 14:39
yes
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:39
ok
Channel
    .fromFilePairs( params.reads )                                             
    .ifEmpty { error "Cannot find any reads matching: ${params.reads}" }  
    .set { read_pairs }

process run_fastqc {
    input:
    set pair_id, file(reads) from read_pairs

    output:
    file '*_fastqc.{zip,html}' into fastqc_results

    """
    mkdir ${pair_id}
    fastqc -q ${reads} -o ${pair_id}
    """
}

process post_process {
  input:
  file 'your-dir/*' from fastqc_results.toSortedList()

  """
  your-script your-dir
  """
}
something like this
Karin Lagesen
@karinlag
Feb 25 2017 14:42
where does your-dir come from? a param?
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:42
what does the trick is fastqc_results.toSortedList()
it collect all the files in the fastqc_results and returns as a sole emission
your-dir/ is any name of your choice
what happens is that it will stage all that files into a subdirectory with the name your-dir in the task work folder
than you can run the post process script on that subdirectory
Karin Lagesen
@karinlag
Feb 25 2017 14:44
....would that work with directories to, or just files?
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:45
both
Karin Lagesen
@karinlag
Feb 25 2017 14:45
...if I understand this correctly, I think it will work :)
thanks! will go test and report back
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:45
cool
wait
I think there could be a problem in the previous task
Karin Lagesen
@karinlag
Feb 25 2017 14:46
ok?
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:46
I mean fastqc -q ${reads} -o ${pair_id}
it writing the output into the ${pair_id} directory
Karin Lagesen
@karinlag
Feb 25 2017 14:47
yeah :)
that was why I was asking for directories :)
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:47
that would work, but the problem is that you are not capturing it with
output: 
file '*_fastqc.{zip,html}' into ..
Karin Lagesen
@karinlag
Feb 25 2017 14:48
I thought I did that when I sent it to fastqc_results?
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:49
either you capture the directory with
output: 
file "${pair_id}" into ..
Karin Lagesen
@karinlag
Feb 25 2017 14:49
what I have now is:

process run_fastqc {
input:
set pair_id, file(reads) from read_pairs

output:
file "$pair_id" into fastqc_results

"""
mkdir ${pair_id}
fastqc -q ${reads} -o ${pair_id}
"""

}

ugh, sorr for the poor formattng
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:50
if so it's ok
:+1:
Karin Lagesen
@karinlag
Feb 25 2017 14:51
:)
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:51
they are doing something similar here
Karin Lagesen
@karinlag
Feb 25 2017 14:51
awesome to get help like this on a sat afternoon, btw :smile:
Paolo Di Tommaso
@pditommaso
Feb 25 2017 14:51
ahhaha
indeed
enjoy your nextflow geek saturday :)
Karin Lagesen
@karinlag
Feb 25 2017 14:53
will do!
Paolo Di Tommaso
@pditommaso
Feb 25 2017 17:21
@karinlag let's use the chat, much easier
Karin Lagesen
@karinlag
Feb 25 2017 20:35
@pditommaso didn't see this until now. Assuming you are not around. Will be around tomorrow daytime though. And thanks for the great help!