Nextflow community chat moved to Slack! https://www.nextflow.io/blog/2022/nextflow-is-moving-to-slack.html
pditommaso on master
Add Header provider to Google B… (compare)
pditommaso on master
Bump FUSION_ prefix variables [… (compare)
I am trying to run some processed using google-lifesciences
. It appears the GATK-4.1.4.1
Docker image from the Broad is causing issues:
Error executing process > 'lens:procd_fqs_to_procd_alns:raw_alns_to_procd_alns:bams_to_base_qual_recal_w_indices:gatk_index_feature_file (Homo_sapiens_assembly38.dbsnp138.vcf.gz)'
Caused by:
Process `lens:procd_fqs_to_procd_alns:raw_alns_to_procd_alns:bams_to_base_qual_recal_w_indices:gatk_index_feature_file (Homo_sapiens_assembly38.dbsnp138.vcf.gz)` terminated with an error exit status (9)
Command executed:
gatk IndexFeatureFile -I Homo_sapiens_assembly38.dbsnp138.vcf.gz
Command exit status:
9
Command output:
(empty)
Command error:
Execution failed: generic::failed_precondition: pulling image: docker pull: running ["docker" "pull" "broadinstitute/gatk:4.1.4.1"]: exit status 1 (standard error: "failed to register layer: Error processing tar file(exit status 1): write /opt/miniconda/lib/python3.6/__pycache__/selectors.cpython-36.pyc: no space left on device\n")
Work dir:
gs://spvensko/work/2a/c230e037330eac63743b8b6e44817a
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
I've tried upping the lifeSciences.bootDiskSize
but that doesn't seem to help. Any tips?
process COMBINE_METRICS{
input:
path gex_metrics, stageAs: 'gex.txt'
path adt_metrics, stageAs: 'adt.txt'
val output_dir
output:
path "metrics_summary.csv", emit: metrics_csv
publishDir "${output_dir}", mode: 'copy'
exec:
for (one_line: file('gex.txt').readLines()){
doThing()
}
}
I have changed the file
declaration in exec
between 'gex.txt'
and gex_metrics
, but in both cases I got a "No such file" error and refers to a nonexistant file on output_dir
(for example, in this case it's "${output_dir}/gex.txt"
. I'm using DSL2.
Can anyone thing of an solution for this? Thanks!
Hi all, I'm having trouble with getting nextflow to email me on completion of the pipeline -- specifically, the workflow.onComplete handler is either not working or I have a weird error.
My config file where the workflow.onComplete is housed has:
workflow.onComplete {
sendMail(
to: ${params.recipient},
subject: 'pipeline execution: ${params.name}'
'''
Pipeline execution summary
---------------------------
Completed at: ${workflow.complete}
Duration : ${workflow.duration}
Success : ${workflow.success}
workDir : ${workflow.workDir}
exit status : ${workflow.exitStatus}
Error report: ${workflow.errorReport ?: '-'}
'''
)
}
And I'm getting either a compilation error at the open brace for workflow.onComplete or an error that says
Unknown method invocation `onComplete` on ConfigObject type -- Did you mean?
compute
Any help you can provide would be greatly appreciated!
Hello @pditommaso and the rest of the nextflow community,
I am a big fan of the approach taken by Nextflow, and I am actively pushing people to adopt it as the principal data flow control language for our genetics community at the major pharmaceutical firm where I work. Our local high-performance UNIX cluster is entirely inadequate to the multitude of algorithms that we need to run, however, so running on a cloud provider is essential. Lately I’ve run up against a challenge when running with the AWS Batch mode, and I’m hoping that someone can suggest a workaround.
In our module-based, DSL-2 driven environment I’ve implemented a structure that I thought would be flexible, using some initial processes to gather (and check for the existence) of files, and then making symbolic links to those files wherever they originate so that I could access them in subsequent steps managed by Nextflow. I’ve been using an approach along these lines
process gather_essaential_files{
input:
val genotype_bim_name
val genotype_bed_name
val genotype_fam_name
val working_dir
output:
path("$genotype_bim_name"), optional: true, emit: geno_bim
path("$genotype_bed_name"), optional: true, emit: geno_bed
path("$genotype_fam_name"), optional: true, emit: geno_fam
script:
"""
ln -s $working_dir/$genotype_bim_name $genotype_bim_name
ln -s $working_dir/$genotype_bed_name $genotype_bed_name
ln -s $working_dir/$genotype_fam_name $genotype_fam_name
"""
After a process such as this one I can treat all the files in subsequent processes as if there local, and the logic all seems very clean. I find that this approach runs wonderfully in my local environment, and making those file links is naturally super fast. When I try to run it on AWS BATCH, however, I don’t get file links, but instead those files each get uploaded to the S3 bucket where next flow is managing its process-specific work directories. This uploading/coping approach would make sense to me if we were dealing with s3-based data (given that S3 isn’t mounted like a real file system), but we aren’t. I’m using AWS FSx to allow access to those files, and therefore they act like any other local files. Here’s my question: is there some parameter/trick that I can use that would allow me to actually create symbolic links instead of copying the files every time they are referenced? We use lots of large data files, and we want to run in a massively parallel environment, and if we are forced to copy those data files every time they are needed by process and the whole approach becomes less viable. I would love a configuration parameter that would allow me to say “link-instead-of-copying” in the context of the AWSBATCH profile. Does anybody have a recommendation for me?
Thanks for any insights, Ben
Hi, I've been trying to port my codebase to DSL2, but I have hit a snag. I have a bunch of paired reads that I put in a channel using fromFilePairs
and chunk using splitFastq
. I pass this to Trimmomatic and from that process I emit tuples of ID, and the two paired file chunks that a given instance has trimmed (tuple val(id), file("${id}_trim_1P.fq"), file("${id}_trim_2P.fq")
).
I previously was running the following to merge the fq files of forward & reverse chunks:
trimmomatic.out.trimmed_reads.collectFile() { item ->
[ "${item[0]}_trim_1P.fq", item[1] ]
}.set{ collected_trimmed_first_reads }
trimmomatic.out.trimmed_reads.collectFile() { item ->
[ "${item[0]}_trim_2P.fq", item[2] ]
}.set{ collected_trimmed_second_reads }
collected_trimmed_first_reads.merge(collected_trimmed_second_reads)
.map { item -> tuple(item[0].simpleName, item[0], item[1]) }
.set{ collected_trimmed_reads }
But in DSL2, with merge
deprecated (I think actually removed as it fails to run after the deprecation warning), I'd like to move to the suggested new pattern using join
. However, I cannot figure out what the pattern is that allows join
to behave like merge
.
I thought maybe to add an integer "key" to each element of the collected file lists produced, but I could only think of doing that in terms of merge
too.
The next step in my workflow absolutely requires these chunks to be collected back up, so I can't sidestep this issue.
Thanks in advance for any suggestions on how I can do this better!
Are there alternatives to installing nextflow with curl get.nextflow.io | bash
?
Although the curl-pipe-sh practice has become commonplace, it's quite inappropriate for a controlled environment.
(Let's please not debate this here. I'm interested only in answers to the narrow question about whether alternatives exist.)
nextflow console
?
Hi people, I stumbled upon an issue for which I am clueless. I have this small R code that I run with Rscript that uses this sequenza library.
Rscript -e 'test <- sequenza::sequenza.extract("${seqz}", verbose = TRUE);'
The above fails when ${seqz} contains the absolute or relative path to a symbolic link, but if it has the "real" path to the file it works. Does someone have a hypothesis for what may be happening?
env {
PYTHONNOUSERSITE = 1
R_PROFILE_USER = "/.Rprofile"
R_ENVIRON_USER = "/.Renviron"
}
nextflow run
. The problem is, if the number of messages is high, these parallel workers initialize nextflow
almost at the same time. This causes the following error to occur in high frequency: Can't lock file: /home/myhomedir/.nextflow/history -- Nextflow needs to run in a file system that supports file locks
. I suspect this happens because when one nextflow process puts a lock on the $HOME/nextflow/history
, another nextflow process tries to put a lock on the same file, before it is released by the former. Is this something intended? Any ideas how to properly handle this without dirty workarounds?
I checked the source code, it is raised from here in modules/nextflow/src/main/groovy/nextflow/util/HistoryFile.groovy
:
try {
while( true ) {
lock = fos.getChannel().tryLock()
if( lock ) break
if( System.currentTimeMillis() - ts < 1_000 )
sleep rnd.nextInt(75)
else {
error = new IllegalStateException("Can't lock file: ${this.absolutePath} -- Nextflow needs to run in a file system that supports file locks")
break
}
}
if( lock ) {
return action.call()
}
}
The problem is, it tries to lock for a sec (if I'm reading Java correctly) and then quits if it can't. Am I not supposed to run multiple nextflow processes in parallel?
I've got a curious issue -- I am running Nextflow on a cluster that I typically do not use. many of my processes are getting errors like the following:
[b8/551435] NOTE: Process `lens:manifest_to_dna_procd_fqs:trim_galore (VanAllen_antiCTLA4_2015/p017/ad-770067)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
Yet, if I go to the work directory, the process is clearly still running:
(base) [spvensko@longleaf-login4 5cb96b7bf20f816c13f22f8f0e3b08]$ realpath .
/pine/scr/s/p/spvensko/work/bd/5cb96b7bf20f816c13f22f8f0e3b08
(base) [spvensko@longleaf-login4 5cb96b7bf20f816c13f22f8f0e3b08]$ ls -lhdrt *
lrwxrwxrwx 1 spvensko users 75 Oct 21 13:11 VanAllen_antiCTLA4_2015-p013-nd-780020_1.fastq.gz -> /pine/scr/s/p/spvensko/fastqs/VanAllen_antiCTLA4_2015/SRR2780020_1.fastq.gz
lrwxrwxrwx 1 spvensko users 75 Oct 21 13:11 VanAllen_antiCTLA4_2015-p013-nd-780020_2.fastq.gz -> /pine/scr/s/p/spvensko/fastqs/VanAllen_antiCTLA4_2015/SRR2780020_2.fastq.gz
-rw-r--r-- 1 spvensko users 3.8G Oct 21 13:23 VanAllen_antiCTLA4_2015-p013-nd-780020_1_trimmed.fq.gz
-rw-r--r-- 1 spvensko users 3.3K Oct 21 13:23 VanAllen_antiCTLA4_2015-p013-nd-780020_1.fastq.gz_trimming_report.txt
-rw-r--r-- 1 spvensko users 630 Oct 21 13:23 VanAllen_antiCTLA4_2015-p013-nd-780020_2.fastq.gz_trimming_report.txt
-rw-r--r-- 1 spvensko users 2.0G Oct 21 13:30 VanAllen_antiCTLA4_2015-p013-nd-780020_2_trimmed.fq.gz
Anyone seen this behavior before?