These are chat archives for nextflow-io/nextflow

26th
Jul 2017
Oskar Vidarsson
@oskarvid
Jul 26 2017 06:31
at the moment it 1. does not take the correct pairs and 2. puts "--FASTQ input.x file.sam --FASTQ2 input.y file.bam" in the command, it should of course not put "input.x" och .y there at all, and also put the correct pairs there to begin with. What am I doing wrong?
Simone Baffelli
@baffelli
Jul 26 2017 06:33
nextflow clean -before failed with ERROR ~ Unexpected error [EOFException]. Is it a known issue?
Oskar Vidarsson
@oskarvid
Jul 26 2017 06:47
I think I fixed it, I changed "file bam/sam from FastqToSam_output/BwaMem_output" to "set pair_id, file(sam/bam) etc"
Oskar Vidarsson
@oskarvid
Jul 26 2017 07:00
One thing that I think is missing is a default setting that identifies which process created which directory in the work directory, it'd make more sense if it looked something like "work/toolname/hash/outputfiles"
thw way WDL does it is roughly like this: "work/workflowname/hash/toolname/{input,output}/etc
there's no way to find a specific execution now, let alone a specific process from a specific execution.
Paolo Di Tommaso
@pditommaso
Jul 26 2017 07:35
Good morning folks. Sorry, timezone in spain is different ;)
@baffelli please run nextflow -log log clean -before and open an issue with the produced log file
One thing that I think is missing is a default setting that identifies which process created which directory in the work directory
Have a look at the tag directive, for example here
also you may need the execution reported created by NF
Simone Baffelli
@baffelli
Jul 26 2017 07:41
Good morning Paolo. I think we are in the same timezone :grimacing:
Oskar Vidarsson
@oskarvid
Jul 26 2017 07:42
maybe the perception of time is different in Spain ;)
Paolo Di Tommaso
@pditommaso
Jul 26 2017 07:42
officially, not the real one :sunglasses:
Oskar Vidarsson
@oskarvid
Jul 26 2017 07:42
I'll be sure to use the tag feature though, and the performance profiling features are definitely useful!
Simone Baffelli
@baffelli
Jul 26 2017 07:43
Anyway, I think I was using the wrong hash, it works now
Paolo Di Tommaso
@pditommaso
Jul 26 2017 07:43
:+1:
maybe a better error message could help
Simone Baffelli
@baffelli
Jul 26 2017 07:44
Indeed :+1:
Oskar Vidarsson
@oskarvid
Jul 26 2017 07:47
the fix I did didn't work as I thought, it still pairs two different files, e.g one from lane 1 and one from lane 2
current version: https://github.com/oskarvid/nextflow-GermlineVarCall/blob/master/bwamem.nf
Is it possible to name the output files something like "basename.sam" so that it can pair the input files to mergebamalignments based on thefilename?
Paolo Di Tommaso
@pditommaso
Jul 26 2017 07:53
yes, but would not make more sense to use the pair_id instead of the file name?
Oskar Vidarsson
@oskarvid
Jul 26 2017 07:53
sure, but how though?
it's as if it's using two different pair_id's, one for bwamem and one for fastqtosam, because it's pairing the wrong files
Paolo Di Tommaso
@pditommaso
Jul 26 2017 07:54
you mean now ?
Oskar Vidarsson
@oskarvid
Jul 26 2017 07:54
yes
Paolo Di Tommaso
@pditommaso
Jul 26 2017 07:55
I guess NOW it's nondeterministic ..
for example here
you can change to

    output:
    set pair_id, file("mergebam.fastqtosam.bwa.bam") into MergeBamAlignment_output
the channel will produce tuple as (pair_id, bam_file) ..
Oskar Vidarsson
@oskarvid
Jul 26 2017 07:57
oh, I didn't think that was necessary
Paolo Di Tommaso
@pditommaso
Jul 26 2017 07:57
well, depends what you need to do
however that will require to change also the input declaration here
but I'm still missing which step is getting the wrong inputs
Oskar Vidarsson
@oskarvid
Jul 26 2017 08:01
bwamem and fastqtosam produce the input files for mergebamalignments, and markduplicates merges them all together and marks the duplicates, so mark duplicates doesn't need to consider any pairs, it just needs to take all input files and run once
Paolo Di Tommaso
@pditommaso
Jul 26 2017 08:01
ah, I think the critical point is this, right ?
Oskar Vidarsson
@oskarvid
Jul 26 2017 08:02
correct
Paolo Di Tommaso
@pditommaso
Jul 26 2017 08:02
You should use phase to match bam and sam files with the same pair_id
for example
sam_and_bam_ch = BwaMem_output.phase(FastqToSam_output) { left, right -> tuple(left[0], left[1], right[1]) }
then replace this input
with
set pair_id, file(same), file(bam) from sam_and_bam_ch
does make sense ?
(need to leave now, morning packed of meeting. I will be on-line in the afternoon)
Oskar Vidarsson
@oskarvid
Jul 26 2017 08:51
It makes sense but it doesn't work, and now, for whatever reason, it doesn't even pair the files correctly in bwamem and fastqtosam :|
Jakob Willforss
@Jakob37
Jul 26 2017 08:57
I had a similar problem, and the code above partially solved it for me. Adding a map function after the phasing solved it for me:
sam_and_bam_ch = BwaMem_output.phase(FastqToSam_output).map { left, right -> tuple(left[0], left[1], right[1]) }
Dani Soronellas
@dsoronellas
Jul 26 2017 09:09

Hi!
I succesfully started a cluster in aws!! Thanks for your help!
I tried to start a test pipeline ("SciLifeLab/NGI-RNAseq"), and the first time I executed it was going fine with this CMD:

~/nextflow run SciLifeLab/NGI-RNAseq -profile docker -w $s3workDir --reads s3readsPath --outdir s3OutDir --fasta s3genomeFa --gtf s3GTFfile

## where s3readsPath is a s3 Dir containing several samples

N E X T F L O W  ~  version 0.25.3-SNAPSHOT
Launching `SciLifeLab/NGI-RNAseq` [fabulous_panini] - revision: 395121e2c5 [master]
WARN: Access to undefined parameter `help` -- Initialise it to a default value eg. `params.help = some_value`
=========================================
 NGI-RNAseq : RNA-Seq Best Practice v1.2
=========================================
[...]
[warm up] executor > ignite
Fetching EC2 prices (it can take a few seconds depending your internet connection) ..
[32/92bc08] Submitted process > fastqc (...)
[b3/43e8a4] Submitted process > trim_galore (...)
[a2/feee1f] Submitted process > makeSTARindex (...)

However, I thought I was using many samples to test (although none of them are more than 700Kb), so I CTRL + C to stop the execution and re run with only a pair of samples:

~/nextflow run SciLifeLab/NGI-RNAseq -profile docker -w $s3workDir --reads s3ModPath --outdir s3OutDir --fasta s3genomeFa --gtf s3GTFfile

But nextflow throw an error:

N E X T F L O W ~ version 0.25.3-SNAPSHOT
Launching SciLifeLab/NGI-RNAseq [hopeful_joliot] - revision: 395121e2c5 [master]
ERROR ~ Unexpected error [UnsupportedOperationException]

The path to the pair of samples is the same as the first CMD except for I specifically look for one sample name within the s3 directory: sample*{1,2}.fastq.gz

I'm not sure if the error comes with for the pipeline or because of nextflow itself, where I have to look for this?
Many thanks for your time,

Oskar Vidarsson
@oskarvid
Jul 26 2017 09:17
@Jakob37 It only made the input from bwamem translate to the filename, the input from fastqtosam is still called input.n, do you have any clue as to why?
Oskar Vidarsson
@oskarvid
Jul 26 2017 09:22
and if I change the order to "FastqToSam_output.phase(BwaMem_output).map" the fastqtosam input file gets translated into the file name while the bwamem input file is called input.n
i.e "--ALIGNED_BAM input.6 --UNMAPPED_BAM FastqToSam.bam" with "FastqToSam_output.phase(BwaMem_output).map", but "--ALIGNED_BAM bwamem.sam --UNMAPPED_BAM input.6" with "BwaMem_output.phase(FastqToSam_output).map"
Jakob Willforss
@Jakob37
Jul 26 2017 09:31
Hmm, I got something similar when experimenting with it. Not sure from where the 'input.n' comes from, but in my case I don't think I addressed the datastructures correctly, having two sets, making the tuple and then extracting the elements. Here is my proof-of-concept example I first got working, not sure whether it addresses your issue: https://github.com/Jakob37/NextFlowExamples/blob/master/synchronizing_channels.nf
I don't have a good overview-understanding of this yet though, so can't help you more there
Oskar Vidarsson
@oskarvid
Jul 26 2017 09:35
thanks anyways, it's good to see code that's similar
Phil Ewels
@ewels
Jul 26 2017 09:35
@dsoronellas - not seen that error before but can take a look if you think that there's an error in the pipeline
On a plane at the moment but will be back online in a few hours
(Pipeline has its own Gitter channel too)
Oskar Vidarsson
@oskarvid
Jul 26 2017 09:41
changing "{ left, right -> tuple(left[0], left[1], right[>0<]) }" to "{ left, right -> tuple(left[0], left[1], right[>1<]) }" seems to have fixed it...
Jakob Willforss
@Jakob37
Jul 26 2017 09:44
Nice!
Oskar Vidarsson
@oskarvid
Jul 26 2017 09:45
tack igen ;)
Jakob Willforss
@Jakob37
Jul 26 2017 09:46
Inga problem :) And thanks Paolo!
Oskar Vidarsson
@oskarvid
Jul 26 2017 12:19
How do I take a tsv as input, take each tab separated value from one row and paste into the command line with "-L value"? I'm trying to use "def intervals = intervals.collect{ "-L $it" }.join(" ")" and it only puts the "-L" before the first value. Here's the code: https://github.com/oskarvid/nextflow-GermlineVarCall/blob/master/baserecal.nf. The channel at line 47 and def at line 73 are the most relevant to what I'm doing. I also added the tsv file to make it more concrete: https://github.com/oskarvid/nextflow-GermlineVarCall/blob/master/groups.list
I'm aiming to scatter gather the process by using groups of contigs per scatter, each row in the groups.list file will create one scatter process with e.g 1:1+ 2:1+ 3:1+ 4:1+ as the contigs to analyze in that specific scatter, and so on for the other rows.
Paolo Di Tommaso
@pditommaso
Jul 26 2017 12:24
umm
the tsv is this
1:1+    2:1+    3:1+    4:1+
5:1+    6:1+    7:1+    8:1+
9:1+    10:1+    11:1+    12:1+
13:1+    14:1+    15:1+    16:1+
17:1+    18:1+    19:1+    20:1+
21:1+    22:1+    X:1+    Y:1+    MT:1+    GL000207.1:1+    G .. etc
and you want to compose this command line
-L 1:1+   -L  2:1+   -L  3:1+  -L  4:1+
?
Oskar Vidarsson
@oskarvid
Jul 26 2017 12:26
for each line I want to start a separate process
Paolo Di Tommaso
@pditommaso
Jul 26 2017 12:26
ok
Oskar Vidarsson
@oskarvid
Jul 26 2017 12:26
each process will have -L 1:1+ -L 2:1+ etc
Paolo Di Tommaso
@pditommaso
Jul 26 2017 12:27
ok, you are almost there
splitText is enough, you don't need to pass as a file
you can just provide the string line as input
hence
replace this with
Channel.fromPath(params.contigs)
        .splitText()
       .set { intervals }
and
this with
val intervals
finally this line
the intervals variable hols the tab separated line, thus you will need to split it and prefix the values with -L
    def intervals = intervals.tokenize('\t').collect{ "-L $it" }.join(" ")
Oskar Vidarsson
@oskarvid
Jul 26 2017 12:32
works like a charm =)
Paolo Di Tommaso
@pditommaso
Jul 26 2017 12:33
you only need file when you need it :)
there's an alternative solution if your are interested
Oskar Vidarsson
@oskarvid
Jul 26 2017 12:34
what's the alternative?
Paolo Di Tommaso
@pditommaso
Jul 26 2017 12:34
instead of using a channel and splitText use a each repeater eg
declare the intervals input as shown below:
input: 
each intervals from file(params.contigs).readLines()
..
Oskar Vidarsson
@oskarvid
Jul 26 2017 12:36
looks pretty clean
Paolo Di Tommaso
@pditommaso
Jul 26 2017 12:36
yes, just repeat the process execution for each line in the file
be careful with empty lines that may exists in the groups.list file
maybe this is a bit more robust
input: 
each intervals from file(params.contigs).readLines().findAll{ it }
Paolo Di Tommaso
@pditommaso
Jul 26 2017 12:41
ie. skip away all the lines empty
Oskar Vidarsson
@oskarvid
Jul 26 2017 12:42
that's really helpful, thanks!
Paolo Di Tommaso
@pditommaso
Jul 26 2017 12:43
welcome
Sergey Venev
@sergpolly
Jul 26 2017 15:09

Hi Paolo,

I hate to bother you again, but I need more insight into nextflow:
1) -resume without storeDir seems to overdo it: -resume restarted successfully completed jobs that failed previously (retried jobs), and then it restarted everyhting downstream as well, even though some of the downstream jobs were complete. Is it expected? How can I avoid it? Is manually deleting some work folders OK, or do one have edit the log file?
2) Does nextflow update the work/hash of a failed job, after that job is successfully retried? Why?
3) Seems like nextflow dismisses some of successfully completed jobs as terminated/aborted ones ... Is there a way to make nextflow more resilient on our LSF-cluster, maybe adjusting pollInterval, dumpInterval, queueStatInterval, exitReadTimeout?

2 more questions unrelated to cluster execution:
4) Can one access the configuration profile flag from inside the pipeline script? (to make some processes execute only on a cluster for example) How?
5) Is it possible to use 2 publishDir statements in per process, to make some files go to one folder, and other into a different one?:

    publishDir path: getIntermediateDir('pairsam_run'), pattern: "*.pairsam.gz" 
    publishDir path: getOutDir('stats_run'), pattern: "*.stats", mode:"copy"
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:10
ummm, so many things :)
5) no, there's a feature request for that . .
Sergey Venev
@sergpolly
Jul 26 2017 15:11
execution related stuff is more important for me now, I can't get it right on our cluster
job fails for unknown reasons sometimes - that's just a a fact
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:11
4) you can access the current profile name, is what you need ?
Sergey Venev
@sergpolly
Jul 26 2017 15:11
we have to live with it
yes
4) - yes
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:12
Sergey Venev
@sergpolly
Jul 26 2017 15:12
got ya!
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:12
3) this looks strange, what do you mean exactly ?
Sergey Venev
@sergpolly
Jul 26 2017 15:13
it maybe related to 2)
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:13
and do you mean exactly by "Does nextflow update the work/hash of a failed job" ?
Sergey Venev
@sergpolly
Jul 26 2017 15:13
nextflow trace shows a process as a FAILED one, but work/hash is full with all the results
43    ad/4613a9    4681790    map_runs (library:HeLa1 run:lane4 chunk:11)    FAILED    -    2017-07-25 19:20:50.513    59m 37s    17m 25s    596.0%    9.4 GB    9.9 GB    7.7 GB    2.8 GB
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:14
NF reports a job as failed as long it terminates with a non-zero exit status by definition
change in the task workdir
Sergey Venev
@sergpolly
Jul 26 2017 15:15
but if you go to work/ad/4613a9.../
it is full with results
and .exitcode is 0
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:15
ummmm
so the pipeline stopped ?
Sergey Venev
@sergpolly
Jul 26 2017 15:16
no, this guy map_runs (library:HeLa1 run:lane4 chunk:11) was re-submitted
and completed later
successfully
which is great! but ...
something else broke later on
and after resume
nextflow restarted from map_runs (library:HeLa1 run:lane4 chunk:11)
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:17
the only other reason is that some output were missing
I remember you were reporting some problem with the file system, or I'm wrong ?
Sergey Venev
@sergpolly
Jul 26 2017 15:18
84    85/ffb814    4681937    map_runs (library:HeLa1 run:lane4 chunk:11)    COMPLETED    0    2017-07-25 20:20:27.630    1h 2m 55s    57m 19s    644.8%    10.1 GB    10.6 GB    18.8 GB    15.1 GB
file system issue was on a different cluster
with SLURM
we were helping collaborators to run distiller on their own
So, ad/4613a9 failed, than 85/ffb814 - completed
and now ad/4613a9 is full with results
and .exitcode=0
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:22
share your log file with pastebin.com
that's .nextflow.log.1
I also have .nextflow.log for the stuff that's running now, after resume
Do you need that one?
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:26
wait
can you try to use the latest snapshot setting running NXF_VER=0.25.3-SNAPSHOT nextflow run ..etc
Sergey Venev
@sergpolly
Jul 26 2017 15:35
would NXF_VER=0.25.3-SNAPSHOT make nextflow update itself?
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:35
yes
Sergey Venev
@sergpolly
Jul 26 2017 15:35
I'll start doing that from now on
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:36
ok
Sergey Venev
@sergpolly
Jul 26 2017 15:36
would that help to resolve stability issues?
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:36
there's a better logging in this version that will help to troubleshoot the problem
Sergey Venev
@sergpolly
Jul 26 2017 15:36
got ya
I'll come back to this issues (if we have more LSF-exeuction troubles) with new logs
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:38
yes, then open an issue on GH and upload the log there, please
Sergey Venev
@sergpolly
Jul 26 2017 15:39
Sure, sure - many thanks again!
Paolo Di Tommaso
@pditommaso
Jul 26 2017 15:39
welcome