These are chat archives for nextflow-io/nextflow

4th
Oct 2018
micans
@micans
Oct 04 2018 11:26
Anyone can recommend a good Groovy book?
ok joking aside i think i need that recommendation as well
micans
@micans
Oct 04 2018 11:29
:musical_note:
Paolo Di Tommaso
@pditommaso
Oct 04 2018 12:10
@micans the groovy bible is Groovy in Action, second edition
Riccardo Giannico
@giannicorik_twitter
Oct 04 2018 13:23

Hi everyone, I'm new to nextflow looking for some help.. I hope it's the right place!
If I have this input file list:

$ ls inputdir
Sample1_S1_L001_R1_001.fastq.gz 
Sample1_S1_L001_R2_001.fastq.gz
Sample1_S1_L002_R1_001.fastq.gz 
Sample1_S1_L002_R2_001.fastq.gz
Sample2_S1_L001_R1_001.fastq.gz 
Sample2_S1_L001_R2_001.fastq.gz
Sample2_S1_L002_R1_001.fastq.gz 
Sample2_S1_L002_R2_001.fastq.gz

For each sample I'll need to merge all R1 files together and all R2 files together.
In bash I can do this on the fly (without creating temporary merged files) like this:

samplelist=$(ls -1 inputdir/*.fastq.gz | awk -F/ '{print $NF}' | sed -re "s/_S[0-9]+_L00[0-9]_/ /g" | cut -d ' ' -f1 | sort | uniq)
# now $samplelist is "Sample1 Sample2" 
for sample in ${samplelist} ; do
  bwa mem ${genome} <(ls inputdir/${sample}_S*_R1_*.fastq.gz | xargs zcat) <(ls inputdir/${sample}_S*_R2_*.fastq.gz | xargs zcat) | samtools view -bS - > ${sample}.bam
done

How can I do the same in nextflow? I know there is a "Channel.fromFilePairs" but in this case i don't have exactly pairs of files in this case. Any idea?

Evan Floden
@evanfloden
Oct 04 2018 13:26
in this case i don't have exactly pairs of files
Is there not four pairs?
You can construct the channel this way (using fromFilePairs) and then use an operator.
@giannicorik_twitter Welcome BTW!
Francesco Strozzi
@fstrozzi
Oct 04 2018 13:40

Ciao @giannicorik_twitter

With Channel.fromFilePairs you can specify a pattern and all the files corresponding to that pattern will be grouped together, even if you have 2 files, 4 files, 6 files etc.

so in this case you could try something like this:
Channel.fromFilePairs(‘/my/data/Sample*_*_R{1,2}_*.fastq.gz')
micans
@micans
Oct 04 2018 13:46
:+1: @pditommaso
Riccardo Giannico
@giannicorik_twitter
Oct 04 2018 13:57
@evanfloden @fstrozzi thank you :P
uhm, do you mean maybe something like this?
Channel
.fromFilePairs('inputdir/*.fastq.gz') { file -> file.name.replaceAll(/_S*_L00*_R*_00*.fastq.gz$/,'') } 
.set { samplelist }

process mapping {
     input:
     sample from samplelist

     """
     bwa mem ${genome} <(ls inputdir/${sample}_S*_R1_*.fastq.gz | xargs zcat) <(ls inputdir/${sample}_S*_R2_*.fastq.gz | xargs zcat) | samtools view -bS - > ${sample}.bam
     """
}
Francesco Strozzi
@fstrozzi
Oct 04 2018 14:05

more something like

samples = Channel.fromFilePairs('/my/data/Sample*_*_R{1,2}_*.fastq.gz’)
process mapping {
      input:
      set val(sample), file(reads) from samples

       """
      bwa mem ${genome} <(ls *_R1_*.fastq.gz | xargs zcat) <(ls *_R2_*.fastq.gz | xargs zcat) | samtools view -bS - > ${sample}.bam
       """
}

If you look at the docs, Channel.fromFilePairs will give you back a list (called a set in Groovy / NF land) with the sample name (i.e. the part that is in common on the pairs of files following your pattern) and another list with the files themselves. Here sample is a variable (declared as val) and reads it’s just the list of all the files belonging to that samples. The file() declaration is where the magic happens, NF will stage those files automatically into the working directory of each process, so with BWA you will have only to call the *_R1_* and *_R2_* files.

Channel.fromFilePairs accepts also more advanced declaration of the pattern, you can for instance pass a closure to it (like a lambda in Python) and split futher the file name just to extract the part that corresponds to the sample name you want to keep. But my advise is to start first with small and simpler examples
Riccardo Giannico
@giannicorik_twitter
Oct 04 2018 14:25
Thank you @fstrozzi from your post here I can finally understand the magic of "you only need to call the *_R1_* and *_R2_* files " because of the two lists ("sample names" and "files"). Very nice :)
But I still have some doubts on how NF can understand wich files belongs to which sample without an explicit string substutution like in my example. Maybe I'll better actually try testing something out and then come back to you after that. :P
Francesco Strozzi
@fstrozzi
Oct 04 2018 14:25
yes, try and test this. You can only declare the Channel and then print its content with println()
and there is also a NF console you can use
just to see how the content of the channel changes with the definition of your pattern
cwytko
@cwytko
Oct 04 2018 15:17
@pditommaso thank you!
Tobias "Tobi" Schraink
@tobsecret
Oct 04 2018 20:24
@giannicorik_twitter here is the link for the console: https://www.nextflow.io/blog/2015/introducing-nextflow-console.html
made it much easier for me to test out patterns for my workflow by writing minimal examples like the one you provided