These are chat archives for nextflow-io/nextflow

15th
Apr 2019
Jason Steen
@jasteen
Apr 15 05:09

having a bit of trouble getting a simple cat to work. I group multiple segment files in a tuple, then need to access the list of files for a cat,

ch_collatedSegments = ch_rawVardictSegments.map{ sample, tbam, nbam, segment -> [sample, tbam, nbam, segment] }.groupTuple(by: [0,1,2])

process catSegments {
    echo true
    input: 
        set sample, tbam, nbam, file(tsv) from ch_collatedSegments.collect()
    output: 
        set sample, tbam, nbam, file("${sample}.collated.vardict.tsv") into ch_rawVardict
    """
    cat ${tsv} > ${sample}.collated.vardict.tsv
    """
}

the structure of ch_collatedSegments is:

[S1, S1_FFPE.consensus.aligned.bam, S1_WB.consensus.aligned.bam, [/scratch/vh83/projects/medha_exomes/test_split/work/e6/4ef975db60de1652ea5138c6caf13c/S1.FFPE_v_WB.seg.6.somatic.vardict.tsv, /scratch/vh83/projects/medha_exomes/test_split/work/41/7a23f5fa61971e722b350661e37015/S1.FFPE_v_WB.seg.3.somatic.vardict.tsv, /scratch/vh83/projects/medha_exomes/test_split/work/4d/fcc4b3d115e1e8988664df8a881d36/S1.FFPE_v_WB.seg.2.somatic.vardict.tsv, /scratch/vh83/projects/medha_exomes/test_split/work/84/6f43ca67210142ab612bb4b170054b/S1.FFPE_v_WB.seg.5.somatic.vardict.tsv, /scratch/vh83/projects/medha_exomes/test_split/work/fb/f7e12fd223dba79f4c20a40b25aa3d/S1.FFPE_v_WB.seg.1.somatic.vardict.tsv, /scratch/vh83/projects/medha_exomes/test_split/work/51/5aa8459486a45303fb358f3a0a378d/S1.FFPE_v_WB.seg.4.somatic.vardict.tsv]]

i've tried a bunch of stuff to get the list of files to cat correctly, but just cant quite make it work. what am I missing?

KennethLim314
@KennethLim314
Apr 15 06:10
Hi everyone, just a quick question. Does anyone has any experience submitting AWS batch Nextflow jobs with AWS lambda? Would this be possible?
micans
@micans
Apr 15 09:52
@jasteen not sure, but I think you something like this:
script:
myfiles = tsv.collect{ it.toString() }.join(' ')
 """
    cat ${myfiles} > ${sample}.collated.vardict.tsv
 """
Ólavur Mortensen
@olavurmortensen
Apr 15 12:33

Nextflow keeps track of all the processes executed in your pipeline. If you modify some parts of your script, only the processes that are actually changed will be re-executed. The execution of the processes that are not changed will be skipped and the cached result used instead.

In my experience, this isn't always true. Nextflow sometimes reruns processes that it should be able to cache, because none of the depdendent code or data has changed. This is most obvious in the case where I resume a pipeline literally without changing anything, and it still doesn't cache everything.

Have I misunderstood something, or is this a problem of some sort?

Paolo Di Tommaso
@pditommaso
Apr 15 13:16
this may happen in one of these cases:
  1. if a task change an input file/directory
  2. you are using a shared file system and it returns inconsistent time stamps
  3. there's an not-deterministic input in one ore more processes (which there should not be)
Chelsea Sawyer
@csawye01
Apr 15 13:42
If I want to remove upper directory structure and just have the file output into the named publishDir instead of the file and all sub directories do I use something like this or is there another way?
publishDir path: "${outDir}/fastq/", mode: 'copy',
       saveAs: { filename ->
         filename =~ /outs\/fastq_path\/Undetermined_.*\.fastq\.gz/) "/Undetermined/${filename.getName()}"}
Paolo Di Tommaso
@pditommaso
Apr 15 13:50
umm, saveAs: takes a closure that should evaluate a expression returning the path where you want to save the file
Jason Steen
@jasteen
Apr 15 13:51
@micans, that feels close but doesnt work
Paolo Di Tommaso
@pditommaso
Apr 15 13:52
@csawye01 maybe pattern works better for your case , see here
Chelsea Sawyer
@csawye01
Apr 15 14:09
@pditommaso does that only publish the file and not the entire file path directory structure? I previously had it like this but it still published all of the directories (ie Undetermined/outs/fastq_path/Undetermined_sample.fastq.gz instead of Undetermined/Undetermined_sample.fastq.gz
publishDir path: "${outDir}/fastq/", mode: 'copy',
       saveAs: { filename ->
         filename =~ /outs\/fastq_path\/Undetermined_.*\.fastq\.gz/) "/Undetermined/${filename}"}
Jason Steen
@jasteen
Apr 15 14:17
@csawye01, there is no way to stop the work directory being populated if thats what you are asking. publishDir just copies or moved the final created files into a more useful place for further analysis. otherwise, just set the name in the process and processDir will just copy it into the folder specified.
micans
@micans
Apr 15 14:38
@jasteen can you make a toy example? I've done a lot of things similar to what you need - a self-contained example would help.
micans
@micans
Apr 15 14:48
(also how exactly does it not work?)
Chelsea Sawyer
@csawye01
Apr 15 14:56
@jasteen No, not the work directory, it's the publish directory I'm having difficulties with.
Jason Steen
@jasteen
Apr 15 15:03

@micans, it gives the following error `ERROR ~ Error executing process > 'catSegments'

Caused by:
java.nio.file.ProviderMismatchException`

i'm trying to make a toy example, but its going to take a bit of time to replicate the data structure since i'm looking at file paths.

micans
@micans
Apr 15 15:03
Hey @jasteen I know what you mean ... files are the roadblock to toy examples.
micans
@micans
Apr 15 15:11
Perhaps if with my first change, you also change set sample, tbam, nbam, file(tsv) to set sample, tbam, nbam, tsv?
Jason Steen
@jasteen
Apr 15 15:49
@micans i'm still getting the ProviderMismatchException, and I dont know enough java to know what it means in this context. i'll have to come up with a toy example tomorrow
Jason Steen
@jasteen
Apr 15 16:07

@micans, now i'm confused. if I create some random .tsv files, and run

ch_files = Channel.fromPath('./*.tsv')
ch_temp = ch_files.map { file -> ['S1', 'S1_FFPE.consensus.aligned.bam', 'S1_WB.consensus.aligned.bam', file] }
ch_temp2 = ch_temp.map{ sample, tbam, nbam, segment -> [sample, tbam, nbam, segment] }.groupTuple(by: [0,1,2])

process catfiles {
  publishDir path: './', mode: 'copy'
  input:
    set sample, tbam, nbam, file(tsv) from ch_temp2
  output:
    set sample, tbam, nbam, file("${sample}.collated.vardict.tsv") into ch_rawVardict
  script:
  myfiles = tsv.collect{ it.toString() }.join(' ')
  """
  cat ${myfiles} > ${sample}.collated.vardict.tsv
  """
}

it works exactly as expected. but in my real data it gives a java error. looks like I need to keep troubleshooting

micans
@micans
Apr 15 16:07
Well that's something. Insert some .view() into your code?
I find that very useful ... observe channels.
Jason Steen
@jasteen
Apr 15 16:11
yeah, its 2am here. I think I might be done for tonight. i'm sure its something simple i'm missing. i'll try and get back to it tomorrow.
micans
@micans
Apr 15 16:11
Good night :-)
Also visit a work directory upstream of your problematic process (if you have one) and make sure it all looks as you expect.
Venkat Malladi
@vsmalladi
Apr 15 17:01
I am outputting the software versions for all files in each process and want to just grab the first list of these files in for a report
file trimReads_vf from trimReadsVersions.collect().first()
to just grab the first list of files
Tobias "Tobi" Schraink
@tobsecret
Apr 15 18:02
Sorry, this might be a dumb question - but how can I tell NF via the config file to load some modules before starting the workflow? Use case: I am writing a config file specific to the two NYU clusters that I am using, to be used by nf-core, so that my colleagues can just use the nf-core pipelines specifying --profile prince. For that to work, squashfs-tools and singularity have to be loaded, so that when pulling the docker container, it can be converted into a singularity container before running the workflow.
I know there is workflow.onComplete {} - is there an equivalent that's workflow.beforeStart?
Paolo Di Tommaso
@pditommaso
Apr 15 18:14
is there an equivalent that's workflow.beforeStart
nope
maybe just wrap it in a bash script ?
Tobias "Tobi" Schraink
@tobsecret
Apr 15 18:16
Thanks for the reply :pray:
kinda defeats the purpose of being able to run nf-core pipelines directly from the commandline - it's just two modules, I'll just write a wiki entry in our cluster wiki
Paolo Di Tommaso
@pditommaso
Apr 15 18:17
I understand that
the modules, are Module Env modules?
Tobias "Tobi" Schraink
@tobsecret
Apr 15 18:17
yup!
Paolo Di Tommaso
@pditommaso
Apr 15 18:17
why not using process.module being so ?
Tobias "Tobi" Schraink
@tobsecret
Apr 15 18:18
I am doing that but the squashfs-tools executeables have to be in the environment for the executor itself afaik?
The problem is not with running jobs that use singularity. The problem is that the first time you pull the pipeline, NF also pulls the docker image and converts it into a singularity image
Paolo Di Tommaso
@pditommaso
Apr 15 18:19
I see
there's no special magic for that
Tobias "Tobi" Schraink
@tobsecret
Apr 15 18:20
And I can't for the life of me figure out how to tell the NF executor to load squashfs tools (idk if it even needs singularity)
Are the config files executeable code?
Paolo Di Tommaso
@pditommaso
Apr 15 18:20
Are the config files executeable code?
Tobias "Tobi" Schraink
@tobsecret
Apr 15 18:20
I guess I could just dump an exec statement in there if they are
Paolo Di Tommaso
@pditommaso
Apr 15 18:20
it depends ..
I see, maybe, at your risk .. :D
Tobias "Tobi" Schraink
@tobsecret
Apr 15 18:21
executeable code in the way that a python module is being executed when you import it
Paolo Di Tommaso
@pditommaso
Apr 15 18:21
you can try, config can execute plain groovy code
Tobias "Tobi" Schraink
@tobsecret
Apr 15 18:22
Ooooh, thanks! I'll give that a try and report back. Testing the pipeline on the cluster rn
Paolo Di Tommaso
@pditommaso
Apr 15 18:22
:+1:
Michael L Heuer
@heuermh
Apr 15 19:35
@Fizol I'm not sure what you are hinting at, I assume that jars in NXF_CLASSPATH must be mounted to disk shared across the cluster. There's also NXF_GRAB, which pulls dependencies from Maven repositories. Are you saying that exec: processes only run on the cluster node that nextflow is running on?
Maciej Pawlaczyk
@Fizol
Apr 15 19:52
@heuermh Look at the sources eg. here ProcessFactoryTest#testSupportType and here ProcessFactory:253, by default executor supports only script directive. Check if this is truth for your executor, in my case SLURM doesn't support it.
Maciej Pawlaczyk
@Fizol
Apr 15 19:57
Yes I'm saing that exec: by default will run on local executor as far I know it from the code :)
Maciej Pawlaczyk
@Fizol
Apr 15 20:18
@pditommaso what's the idea behind in code logback configuration? I mean class LoggerHelper setup in Luncher
Jason Steen
@jasteen
Apr 15 22:16
can anyone explain under what circumstances nextflow might throw a java.nio.file.ProviderMismatchException error? is it because i'm trying to access files that havent been collected properly?