These are chat archives for nextflow-io/nextflow

9th
Oct 2018
Maxime Vallée
@valleem
Oct 09 2018 05:59
Hello! Report v2 (about the processes that re-run even if finished properly) : changing from toList() to collect() is not working. I am developing further processes downstream. I re-ran the pipeline this morning and all chromosomes, that have nicely finished (and even gone to several steps downstream already), are still re-submitting this early process anyway.
process GenomicsDBImport {

    cpus 1 
    memory '72 GB'
    time '12h'

    tag { chr }

    input:
    each chr from chromosomes_ch
    file (gvcf) from gvcf_ch.collect()
    file (gvcf_idx) from gvcf_idx_ch.collect()

    output:
    set chr, file ("${params.cohort}.${chr}.tar") into gendb_ch

    script:
    """
    ${GATK} GenomicsDBImport --java-options "-Xmx24g -Xms24g" \
    ${gvcf.collect { "-V $it " }.join()} \
    -L ${chr} \
    --genomicsdb-workspace-path ${params.cohort}.${chr}

    tar -cf ${params.cohort}.${chr}.tar ${params.cohort}.${chr}
    """
}
the strangest thing is : if I kill NF, and immediately relaunch it, some (and not all) chromosomes are actually skipped :
[4a/9f4aca] Cached process > GenomicsDBImport (chr6)
[5e/73c8ba] Cached process > GenomicsDBImport (chr5)
[f4/06b57b] Cached process > GenomicsDBImport (chr16)
again, if I kill NF, and re-launch, another set is randomly cached?
[5e/73c8ba] Cached process > GenomicsDBImport (chr5)
[ec/6f5161] Cached process > GenomicsDBImport (chr7)
[2d/301fc5] Cached process > GenomicsDBImport (chr2)
Anthony Underwood
@aunderwo
Oct 09 2018 06:04
@pditommaso I solved the resume issue!! My trimming process took two inputs from two channels from 2 previous processes including stdout from one of them. I was not using the join operation correctly so that the same 2 inputs were not always entering the trimming process
Maxime Vallée
@valleem
Oct 09 2018 06:05
(I tried again maybe a dozen time, I will not paste everything, but it is always random. Some chromosomes are skipped some are not, and it is not the same each time)
Anthony Underwood
@aunderwo
Oct 09 2018 06:06
@valleem - I had a similar problem that was due to the inputs to a channel not being consistently constructed using a join operation (see post above). using the -dump-hashes argument helped me debug this
Maxime Vallée
@valleem
Oct 09 2018 06:06
ok I will try, thanks
Maxime Vallée
@valleem
Oct 09 2018 06:16
@aunderwo I cannot try it, as it might be somewhat straining on login nodes : I keep getting kicked out the cluster with -dump-hashes argument
Anthony Underwood
@aunderwo
Oct 09 2018 06:18
@valleem :( shouldn't be - in my hands it just prints more verbose output to stdout- the hashes of the input for each process
Maxime Vallée
@valleem
Oct 09 2018 06:23
@aunderwo yup, just received an e-mail from sysadmins, the command is consuming too much memory and it is kicking me out :)
Anthony Underwood
@aunderwo
Oct 09 2018 06:23
wow - maybe the input hashes are huge!
Maxime Vallée
@valleem
Oct 09 2018 06:24
haha yeah! I will try it in a job in interactive mode then
Maxime Vallée
@valleem
Oct 09 2018 07:56

Update : I ran it on a subset to avoid getting kicked. But I do not understand the outcome...

I ran the pipeline once without -resume to be sure to have a fresh start, and saved the output of -dump-hashes. Everything has worked as expected, my processes ran fine and the output was created, no errors, as usual.

I re-ran the pipeline with -resume, and saved the output in an other file, with -dump-hashes as well. Sure enough, some random processes (chromosome) were not cached, and re-launched even if already completed.

When diving in the -dump-hashes created logs : of course, for those processes that have re-launched, the "cache hash" were different. I wanted to understand why, so I extracted all hashes from all [java.util.UUID], [java.lang.String], [nextflow.util.ArrayBag]... on the processes with different cache hash (the ones that re-launched wrongly), and they are all the same, on the first try and on the resumed... I do not understand how the cache hash is made.

to be clearer : the chromosome 6 has re-run and it should not have (it has worked fine on the first launch). Look at all the lines created by -dump-hashes on the first run log and the second run log :
 diff <(grep "\[GenomicsDBImport (chr6" test_dump2.log -A 24 ) <(grep "\[GenomicsDBImport (chr6" test_dump3.log -A 24 )
1c1
< [GenomicsDBImport (chr6)] cache hash: c7326f430b958e918c5c8a9af397c6f4; mode: STANDARD; entries:
---
> [GenomicsDBImport (chr6)] cache hash: 7908ad3be5fff99f0cfb5780831f60dd; mode: STANDARD; entries:
Maxime Vallée
@valleem
Oct 09 2018 08:01
On the 24 lines logged about this process, only the 1st, about the cache hash is different
Maxime Vallée
@valleem
Oct 09 2018 08:38

To nail the example further, I re-launched again, look at chr18. First run is test_dump2.log, second run is test_dump3.log, third is test_dump4.log.
Difference between first and second run : none. Process cached as expected :

$ diff <(grep "\[GenomicsDBImport (chr18" test_dump2.log -A 24 ) <(grep "\[GenomicsDBImport (chr18" test_dump3.log -A 24 )
$ #no output here

Between second and third run : cache hash has changed, re-run the process that was fine few minutes ago :

$ diff <(grep "\[GenomicsDBImport (chr18" test_dump3.log -A 24 ) <(grep "\[GenomicsDBImport (chr18" test_dump4.log -A 24 )
1c1
< [GenomicsDBImport (chr18)] cache hash: 1605a56b7ce677ab1a1c71ff7babdb06; mode: STANDARD; entries:
---
> [GenomicsDBImport (chr18)] cache hash: c828e48d62ce03dd2ca9cc5387cf9132; mode: STANDARD; entries:
Johannes Alneberg
@alneberg
Oct 09 2018 09:22

Hello! I've recently seen something strange on our cluster (SLURM). Nextflow failed with the status

Completed at: Mon Oct 08 13:57:18 CEST 2018
Duration    : 39m 16s
Success     : false
Exit status : null
Error report: Error executing process > 'MapReads (ZZZ-ZZZ)'

Caused by:
  Process `MapReads (ZZZ-ZZZ)` terminated for an unknown reason -- Likely it has been terminated by the external system

But the strange thing is that that job is actually still running. It seems likely this is caused by a temporary glitch in some kind of slurm-status checking, so my question is: How often does nextflow contact the slurm-scheduler to keep track of the jobs it is running? Sometimes our slurm system is rather slow so would nextflow be able to deal with that without these kind of errors occur? Thanks!

Paolo Di Tommaso
@pditommaso
Oct 09 2018 10:15
um, weird
@alneberg NF checks that every minutes, try also to check in the log file, there should be some more info
Johannes Alneberg
@alneberg
Oct 09 2018 11:32
Here's the bit I deemed most relevant:
Oct-08 13:57:18.812 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exist status for process TaskHandler[jobId: 404765; id: 6; name: MapReads (ZZZ-ZZZ); status: RUNNING; exit: -; error: -; workDir: /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/XXX/run_dir_KK/work/bf/0283501e0fb93c652e2a06cb14be1e started: 1538999413783; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 270028
Current queue status:
>   (null)

Content of workDir: /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/XXX/run_dir_KK/work/bf/0283501e0fb93c652e2a06cb14be1e
null
Oct-08 13:57:18.814 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 404765; id: 6; name: MapReads (ZZZ-ZZZ); status: COMPLETED; exit: -; error: -; workDir: /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/XXX/run_dir_KK/work/bf/0283501e0fb93c652e2a06cb14be1e started: 1538999413783; exited: -; ]
Maxime Vallée
@valleem
Oct 09 2018 11:32
@pditommaso would like me to try something I did not attempt on this issue?
Paolo Di Tommaso
@pditommaso
Oct 09 2018 11:41
@alneberg yes, it seems that it's unable to fetch the slurm status, see here
@valleem not sure to understand what issue are you referring
Maxime Vallée
@valleem
Oct 09 2018 11:46
@pditommaso sorry. I started this morning debugging the issue about process that re-run after a resume even if completed nicely. Yesterday you suggested to try moving from .toList() to .collect(). Unfortunately it has not resolve the issue. I tried debugging why process re-run with -dump-hashes and my readings of logs have not been helpful. If you look at my 10:38am comment, you will see that the same chromosome is once adequately skipped, then re-run on a third try. I am open to any suggestion to debug it.
Paolo Di Tommaso
@pditommaso
Oct 09 2018 11:48
try to isolate the issue in a replicate test case, otherwise it's very hard to provide any help
Johannes Alneberg
@alneberg
Oct 09 2018 11:51
Ok, thank you @pditommaso, we'll just sign this one off and blame the cluster for it. Thanks!
Maxime Garcia
@MaxUlysse
Oct 09 2018 11:51
Blame the cluster, that seems like a good solution
Paolo Di Tommaso
@pditommaso
Oct 09 2018 11:52
:)
Johannes Alneberg
@alneberg
Oct 09 2018 11:58
I try not to. But this one seems to be caused by a hick-up to me.
Paolo Di Tommaso
@pditommaso
Oct 09 2018 12:06
sounds possible, I've just realised that this kind of error is quietly ignored
Johannes Alneberg
@alneberg
Oct 09 2018 14:12
Hmm, what do you mean quietly ignored? It seems to be enough to shut down the nextflow run at least?
Paolo Di Tommaso
@pditommaso
Oct 09 2018 14:13
look this line
Johannes Alneberg
@alneberg
Oct 09 2018 14:14
Yes I saw it but I have no idea what implications the null gives rise to...
Paolo Di Tommaso
@pditommaso
Oct 09 2018 14:14
that's the null you see in the log
Current queue status:
>   (null)
Johannes Alneberg
@alneberg
Oct 09 2018 14:15
Aha!
Paolo Di Tommaso
@pditommaso
Oct 09 2018 14:15
mainly what I mean that it should report a warning message when that command fail ie. it returns a non-zero exist status
Johannes Alneberg
@alneberg
Oct 09 2018 14:16
Yes a warning message and potentially also error message from the underlying error?
Maxime Vallée
@valleem
Oct 09 2018 14:18
@pditommaso I will try to extract a reproducible subset and make it available to test. I got a question in the meantime. -dump-hashes showed me that there are hashes for everything in a process. And also on the 'tag' of the process as well (eg [GenomicsDBImport (chr18)] cache hash: c828e48d62ce03dd2ca9cc5387cf9132; mode: STANDARD; entries:). How is this hash created?
Yes a warning message and potentially also error message from the underlying error?
yes
Maxime Vallée
@valleem
Oct 09 2018 14:23
Ok, so, it is not only created based on all the hashes stated by the output of -dump-hashes then... I will try to investigate what is responsible to changing the task hash with no apparent reason for my process.
Martin Proks
@matq007
Oct 09 2018 15:34

Hi, I have issues accessing data when running the pipeline using docker. I've made a container where I have data folders which I use in the pipeline but when I start the pipeline

nextflow run NGI-RNAfusion/ -with-docker rnafusion-test -profile test,docker --read 'test-data/SRR6129597_{1,2}.fastq.gz' --genome R64-1-1
N E X T F L O W  ~  version 0.32.0
Launching `NGI-RNAfusion/main.nf` [awesome_volta] - revision: a06a716425
ERROR ~ Fasta file not found: /data/sf/genome.fa

I think it's because it is not trying to access the data inside the containers but uses my local machine instead. Is there a way so it uses docker data insteaf of local ones?

Paolo Di Tommaso
@pditommaso
Oct 09 2018 15:36
that's not a pattern supported by NF
Martin Proks
@matq007
Oct 09 2018 15:37
ah... what a shame :/
I wanted to make a solo test container with the data to run the pipeline :|
Martin Proks
@matq007
Oct 09 2018 15:54
@pditommaso so the only way would be to use AWS to mount the data files in my case?
Mike Smoot
@mes5k
Oct 09 2018 15:57
@matq007 were you trying to run the pipeline in one container and have the data in a different container? Or was everything in one container?
Martin Proks
@matq007
Oct 09 2018 15:58
@mes5k I've made a custom container with all the tools and genome data that I need for the pipeline. I have to have som special data to run certain tools like fusioncatcher or star-fusion.
micans
@micans
Oct 09 2018 16:02
@pditommaso we are working on porting a pipeline to nextflow where it seems that the basic output units are directory names. Can we publishDir directory names?
Paolo Di Tommaso
@pditommaso
Oct 09 2018 16:22
Yes
micans
@micans
Oct 09 2018 16:23
:sweet_potato:
the only thing that began with sweet
Paolo Di Tommaso
@pditommaso
Oct 09 2018 16:24
:joy:
Mike Smoot
@mes5k
Oct 09 2018 16:33

@matq007 the problem is that nextflow itself doesn't run in the container, so any channels you create reading you custom data will fail. You could, however, just read the data directly in the process. Here's an example. I created a simple container that includes a file called /opt/data/input.txt and run my pipeline in that container:

Channel
    .fromPath('/opt/data/*.txt')
    .set{ homer } // will be empty


process wont_run {
    input:
    file(f) from homer

    output:
    stdout into hout

    script:
    """
    cat ${f}
    """
}

hout.view()  // won't run

Channel
    .from(1)
    .set{ marge }

process forced_to_run {
    input:
    val(x) from marge

    output:
    stdout into mout

    script:
    """
    # this data exists in the container
    cat /opt/data/*.txt
    """
}

mout.view() // should display contents of text files in /opt/data

The difficulty comes when you actually want to do anything with the files. If you're just referencing them in the process (e.g. a blast database) then maybe this is OK. However, if you want to take the data in your container and use it in channels, then you'll basically be copying the data out of the container into your local work directory, which seems less good.

Brad Langhorst
@bwlang
Oct 09 2018 18:59
I’ve run a nextflow workflow on a large set of data and failed at a late step. Normally I fix that and re-run to take advantage of the cached jobs. However these jobs are not being rescued. I can find the files in ths working area: .. any way to get nextflow to find them? Would save a couple of days of computing.
Mike Smoot
@mes5k
Oct 09 2018 19:06
I'm guessing that nextflow sees the files, but it thinks something has changed (file modification time or some such). You could try setting using either deep or lenient caching, but I'm not sure if making that change would further invalidate the cache...
Tintest
@Tintest
Oct 09 2018 19:48

Hello, I need some help :)

I have a multiple output channel (at least I think so), but my process which has this channel in input is only executed once.

Here is the channel :

    select_name_ch
      .collect()
      .flatten()
      .combine(genotypegvcfs_ch)
      .set {genotype_combined}

Here is the output (when I add a subscribe println it) :

[2049229374-1203_S7, Y, /bettik/tintest/PROJECTS/Test_nextflow_OAR/work/9e/3afb9d6e05e55afe67f7a837ddafa9/cohort_test_grexome_Y_genotyped.vcf]
[2049229025-1202_S6, Y, /bettik/tintest/PROJECTS/Test_nextflow_OAR/work/9e/3afb9d6e05e55afe67f7a837ddafa9/cohort_test_grexome_Y_genotyped.vcf]
[2049229863-1208_S5, Y, /bettik/tintest/PROJECTS/Test_nextflow_OAR/work/9e/3afb9d6e05e55afe67f7a837ddafa9/cohort_test_grexome_Y_genotyped.vcf]
[2049229374-1203_S7, 13, /bettik/tintest/PROJECTS/Test_nextflow_OAR/work/70/4099e3151af0947285a193733a85e0/cohort_test_grexome_13_genotyped.vcf]
[2049229025-1202_S6, 13, /bettik/tintest/PROJECTS/Test_nextflow_OAR/work/70/4099e3151af0947285a193733a85e0/cohort_test_grexome_13_genotyped.vcf]
[2049229863-1208_S5, 13, /bettik/tintest/PROJECTS/Test_nextflow_OAR/work/70/4099e3151af0947285a193733a85e0/cohort_test_grexome_13_genotyped.vcf]
[2049229374-1203_S7, 22, /bettik/tintest/PROJECTS/Test_nextflow_OAR/work/8b/0797f3436860f25fa21cd152db504d/cohort_test_grexome_22_genotyped.vcf]
[2049229025-1202_S6, 22, /bettik/tintest/PROJECTS/Test_nextflow_OAR/work/8b/0797f3436860f25fa21cd152db504d/cohort_test_grexome_22_genotyped.vcf]
[2049229863-1208_S5, 22, /bettik/tintest/PROJECTS/Test_nextflow_OAR/work/8b/0797f3436860f25fa21cd152db504d/cohort_test_grexome_22_genotyped.vcf]

And here is my process :

        process select_variants_single {
            errorStrategy 'finish'
            echo true

            input:
            set val(input_name), val(chr), file(genotypegvcfs) from genotype_combined
            file (batch_list) from final_name_list_ch
            file (control_list) from control_list_ch

            output:
            set val(input_name),file("${input_name}_${chr}.vcf") into variant_selected_ch

            stdout select_variantsout

            shell:

            '''
            echo SELECT_VARIANTS "!{input_name} chr!{chr}"
            echo !{input_name} > sample.list
            python !{params.subsidiaryDir}/select_variants.py -i !{genotypegvcfs} -p !{input_name} -s sample.list -b !{batch_list} -c !{control_list} -o !{input_name}_!{chr}.vcf -e !{input_name}_!{chr}.log
            '''


        }

        select_variantsout.subscribe { print "$it" }

    }

So I thought i would be executed 9 times but it doesn't. Am I missing something ? Thank you :)

Mike Smoot
@mes5k
Oct 09 2018 19:57
The cardinality of genotype_combined needs to match that of final_name_list_ch and control_list_ch. Do both of those have 9 entries? If not you may need to combine the input channels before getting to the process.
Tintest
@Tintest
Oct 09 2018 19:57
Ok, I'll do that. Thank you :)
Tintest
@Tintest
Oct 09 2018 20:06
Indeed it does the trick :)